CSN snapshots in hot standby

Started by Heikki Linnakangasalmost 2 years ago15 messages

hlinnaka@iki.fi

almost 2 years ago

1 attachment(s)

You cannot run queries on a Hot Standby server until the standby has
seen a running-xacts record. Furthermore if the subxids cache had
overflowed, you also need to wait for those transactions to finish. That
is usually not a problem, because we write a running-xacts record after
each checkpoint, and most systems don't use so many subtransactions that
the cache would overflow. Still, you can run into it if you're unlucky,
and it's annoying when you do.

It occurred to me that we could replace the known-assigned-xids
machinery with CSN snapshots. We've talked about CSN snapshots many
times in the past, and I think it would make sense on the primary too,
but for starters, we could use it just during Hot Standby.

With CSN-based snapshots, you don't have the limitation with the
fixed-size known-assigned-xids array, and overflowed sub-XIDs are not a
problem either. You can always enter Hot Standby and start accepting
queries as soon as the standby is in a physically consistent state.

I dusted up and rebased the last CSN patch that I found on the mailing
list [1]/messages/by-id/2020081009525213277261@highgo.ca, and modified it so that it's only used during recovery. That
makes some things simpler and less scary. There are no changes to how
transaction commit happens in the primary, the CSN log is only kept
up-to-date in the standby, when commit/abort records are replayed. The
CSN of each transaction is the LSN of its commit record.

The CSN approach is much simpler than the existing known-assigned-XIDs
machinery, as you can see from "git diff --stat" with this patch:

32 files changed, 773 insertions(+), 1711 deletions(-)

With CSN snapshots, we don't need the known-assigned-XIDs machinery, and
we can get rid of the xact-assignment records altogether. We no longer
need the running-xacts records for Hot Standby either, but I wasn't able
to remove that because it's still used by logical replication, in
snapbuild.c. I have a feeling that that could somehow be simplified too,
but didn't look into it.

This is obviously v18 material, so I'll park this at the July commitfest
for now. There are a bunch of little FIXMEs in the code, and needs
performance testing, but overall I was surprised how easy this was.

(We ran into this issue particularly hard with Neon, because with Neon
you don't need to perform WAL replay at standby startup. However, when
you don't perform WAL replay, you don't get to see the running-xact
record after the checkpoint either. If the primary is idle, it doesn't
generate new running-xact records, and the standby cannot start Hot
Standby until the next time something happens in the primary. It's
always a potential problem with overflowed sub-XIDs cache, but the lack
of WAL replay made it happen even when there are no subtransactions
involved.)

[1]: /messages/by-id/2020081009525213277261@highgo.ca

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v1-0001-Use-CSN-snapshots-during-Hot-Standby.patchtext/x-patch; charset=UTF-8; name=v1-0001-Use-CSN-snapshots-during-Hot-Standby.patchDownload

From 1691eea14d1d3593395dec2f513697bf5145135c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 29 Mar 2024 19:53:19 +0200
Subject: [PATCH v1 1/1] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |   12 +-
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  469 ++++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    2 +
 src/backend/access/transam/xact.c             |  126 +-
 src/backend/access/transam/xlog.c             |  111 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1453 ++---------------
 src/backend/storage/ipc/standby.c             |   42 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   37 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/storage/standby.h                 |   26 +-
 src/include/utils/snapshot.h                  |    7 +
 32 files changed, 773 insertions(+), 1711 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 1a1a4ff7be7..cb0c49c7a44 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -558,15 +558,10 @@ collect_visibility_data(Oid relid, bool include_pd)
 static TransactionId
 GetStrictOldestNonRemovableTransactionId(Relation rel)
 {
-	RunningTransactions runningTransactions;
-
 	if (rel == NULL || rel->rd_rel->relisshared || RecoveryInProgress())
 	{
 		/* Shared relation: take into account all running xids */
-		runningTransactions = GetRunningTransactionData();
-		LWLockRelease(ProcArrayLock);
-		LWLockRelease(XidGenLock);
-		return runningTransactions->oldestRunningXid;
+		return GetOldestActiveTransactionId(true);
 	}
 	else if (!RELATION_IS_LOCAL(rel))
 	{
@@ -574,10 +569,7 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 		 * Normal relation: take into account xids running within the current
 		 * database
 		 */
-		runningTransactions = GetRunningTransactionData();
-		LWLockRelease(ProcArrayLock);
-		LWLockRelease(XidGenLock);
-		return runningTransactions->oldestDatabaseRunningXid;
+		return GetOldestActiveTransactionId(false);
 	}
 	else
 	{
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 41b842d80ec..c1c3974ac9b 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -421,17 +421,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -459,18 +448,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -502,9 +479,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..2520d77c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 00000000000..fcaf3a3bca4
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,469 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit sequence numbers of finished transactions
+ *
+ * This module provides SLRU to store CSN for each transaction.  This
+ * mapping need to be kept only for xid's greater then oldestXid, but
+ * that can require arbitrary large amounts of memory in case of long-lived
+ * transactions.  Because of same lifetime and persistancy requirements
+ * this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+									  TransactionId *subxids,
+									  XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+									  int slotno);
+
+
+/*
+ * CSNLogSetCSN
+ *
+ * Record XidCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should be
+ * AbortedCSN for abort cases.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, XLogRecPtr csn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);		/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							csn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the csn log for
+ * all entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as TransactionIdSetTreeStatus()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+						   TransactionId *subxids,
+						   XLogRecPtr csn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i],	csn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, csn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	// FIXME: skip if not InHotStandby?
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	//SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus status = TransactionIdGetStatus(*xid, &dummy);
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'.
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we
+	 * initialize the currently-active page(s) to zeroes during startup.
+	 * Whenever we advance into a new page, ExtendCSNLog will likewise
+	 * zero the new page without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557cd..cf41df2971f 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 75b5325df8b..2db17fa6928 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8090ac9fc19..99b1f4f8979 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1953,20 +1954,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -1994,34 +1988,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fb6a86afcb1..f1162ff1393 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -202,6 +203,7 @@ GetNewTransactionId(bool isSubXact)
 	 * Extend pg_subtrans and pg_commit_ts too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCSNLog(xid);
 	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index df5a67e4c31..54569546025 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -207,7 +208,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -247,13 +247,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -529,18 +522,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -633,7 +614,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -677,20 +657,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -726,59 +692,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData((char *) unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1540,6 +1453,7 @@ RecordTransactionCommit(void)
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
+
 cleanup:
 	/* Clean up local data */
 	if (rels)
@@ -1922,13 +1836,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2092,12 +1999,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6118,7 +6019,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6138,6 +6039,11 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * TODO: We must mark CSNLOG first
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6150,9 +6056,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * We must mark clog and csnlog before we update the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ExpireTreeKnownAssignedTransactionIds(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6258,7 +6164,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6282,7 +6188,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ExpireTreeKnownAssignedTransactionIds(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6390,14 +6296,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3bdd9a2ddd3..3c5b36daf25 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -950,8 +951,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5099,6 +5098,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5693,16 +5693,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5717,38 +5717,19 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
+
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
+
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
 			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
+			 * Recover additional standby state for prepared transactions.
 			 */
 			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
-
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_overflow = false;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-
 				StandbyRecoverPreparedTransactions();
-			}
 		}
 
 		/*
@@ -5832,7 +5813,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -5993,9 +5974,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6092,7 +6082,7 @@ StartupXLOG(void)
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -6842,7 +6832,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7226,7 +7216,10 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
@@ -7396,6 +7389,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7692,7 +7686,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8177,38 +8174,15 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
+		 * Recover additional standby state for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_overflow = false;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			StandbyRecoverPreparedTransactions();
 		}
@@ -8274,6 +8248,13 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Update oldestActiveXid
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b2fe2d04ccf..dac9f639e73 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1985,10 +1985,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2264,7 +2263,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3709,9 +3708,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3978,9 +3974,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe07..bf08c60e93a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index ef6f98ebcd7..a975865fdd9 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7a86f8481db..daf6eb02759 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e37e22f4417..3261ba12832 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 90c84fec27a..465a04770be 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -126,6 +127,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -304,6 +306,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 88a6d504dff..0689d5b9838 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -73,22 +64,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -254,17 +231,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -278,17 +244,8 @@ static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -315,7 +272,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -325,7 +282,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -338,28 +295,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -395,19 +336,12 @@ ProcArrayShmemSize(void)
 	 * Ideally we'd only create this structure if we were actually doing hot
 	 * standby in the current run, but we don't know that yet at the time
 	 * shared memory is being set up.
+	 *
+	 * XXX: misplaced now
 	 */
 #define TOTAL_MAX_CACHED_SUBXIDS \
 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
 
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -434,31 +368,12 @@ CreateSharedProcArray(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1022,343 +937,30 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
-	/*
-	 * Remove stale locks, if any.
-	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (!running->subxid_overflow || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
-
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_overflow)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1366,15 +968,17 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ *
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction might still be running on
+ * the primary.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
  * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
  * This is the slowest way, but sadly it has to be done always if the others
@@ -1423,6 +1027,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1439,12 +1065,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1539,33 +1160,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1839,8 +1433,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1853,11 +1446,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2176,7 +1772,7 @@ GetSnapshotData(Snapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2356,27 +1952,8 @@ GetSnapshotData(Snapshot snapshot)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2384,14 +1961,15 @@ GetSnapshotData(Snapshot snapshot)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * XXX: not sure that's strictly required.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2507,6 +2085,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->lsn = InvalidXLogRecPtr;
 	snapshot->whenTaken = 0;
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2856,15 +2436,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2885,11 +2466,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2898,6 +2481,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2978,6 +2564,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
 	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
 	 * machinery can miss values and return an older value than is safe.
+	 * XXX: obsolete comment as KnownAssignedXids is gone. I believe no code
+	 * changes are required here, but TBH I don't understand this function.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3395,6 +2983,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4063,14 +3654,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4347,61 +3938,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4416,7 +3952,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4434,38 +3970,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4478,780 +3995,54 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 /*
  * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction()
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ExpireTreeKnownAssignedTransactionIds(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * XXX: this is necessary to keep xmin more recent. Which in turn is needed
+	 * to avoid unnecessary recovery conflicts
+	 *
+	 * XXX: no locking needed because this runs in the startup process
+	 *
+	 * XXX: the caller actually has a list of XIDs. We could save some clog
+	 * lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 87b04e51b36..078ed53acda 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -487,6 +482,19 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 	Assert(TransactionIdIsNormal(snapshotConflictHorizon));
 	backends = GetConflictingVirtualXIDs(snapshotConflictHorizon,
 										 locator.dbOid);
+	{
+		StringInfoData buf;
+		VirtualTransactionId *vxid;
+
+		initStringInfo(&buf);
+		for (vxid = backends; vxid->procNumber != INVALID_PROC_NUMBER; vxid++)
+		{
+			appendStringInfo(&buf, " %d", GetPGProcByNumber(vxid->procNumber)->pid);
+		}
+
+		elog(LOG, "ResolveRecoveryConflictWithSnapshot called for xid %u isCatalogRel %d, conflicts with: [%s ]",
+		 snapshotConflictHorizon, isCatalogRel, buf.data);
+	}
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
@@ -1164,7 +1172,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1179,18 +1187,16 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
-
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_overflow = xlrec->subxid_overflow;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
 
-		ProcArrayApplyRecoveryInfo(&running);
+		/* not strictly required, but update oldestRunningXid because we can */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1280,6 +1286,10 @@ standby_redo(XLogReaderState *record)
  *
  *
  * Returns the RecPtr of the last inserted record.
+ *
+ * XXX: We only need to the running-xacts record for logical replication
+ * snapshot builder stuff now. If we stop emitting it here, will the callers
+ * be happy if we return InvalidXLogRecPtr?
  */
 XLogRecPtr
 LogStandbySnapshot(void)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b1e388dc7c9..d3de6babe67 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -133,6 +133,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -169,6 +170,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d8..5822af41eaa 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -342,6 +342,7 @@ SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> s
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index d7725443774..ffbfae84b80 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..358665b6dd9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -48,6 +48,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -201,6 +202,7 @@ typedef struct SerializedSnapshotData
 	CommandId	curcid;
 	TimestampTz whenTaken;
 	XLogRecPtr	lsn;
+	XLogRecPtr	snapshotCsn;
 } SerializedSnapshotData;
 
 /*
@@ -1729,6 +1731,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.curcid = snapshot->curcid;
 	serialized_snapshot.whenTaken = snapshot->whenTaken;
 	serialized_snapshot.lsn = snapshot->lsn;
+	serialized_snapshot.snapshotCsn = snapshot->snapshotCsn;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -1803,6 +1806,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1913,36 +1917,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
+		XLogRecPtr csn = CSNLogGetCSNByXid(xid);
 
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
-
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 30e17bd1d1e..0c62d9791da 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -249,7 +249,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 00000000000..95913e63c90
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Commit-Sequence-Number log.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+							   TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif   /* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..d216ed18282 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 56248c00063..44f7d6fe965 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f0524..df0af5ea209 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -442,7 +433,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e24613e8f81..7848cf7d130 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -24,37 +24,10 @@
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..c2156aca12d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -179,6 +179,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -215,6 +216,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 8ca60504622..95f2f39e192 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ExpireTreeKnownAssignedTransactionIds(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 0fc0804e266..aee7f5592cd 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -47,19 +47,6 @@ extern void LogRecoveryConflict(ProcSignalReason reason, TimestampTz wait_start,
 								TimestampTz now, VirtualTransactionId *wait_list,
 								bool still_waiting);
 
-/*
- * Standby Rmgr (RM_STANDBY_ID)
- *
- * Standby recovery manager exists to perform actions that are required
- * to make hot standby work. That includes logging AccessExclusiveLocks taken
- * by transactions and running-xacts snapshots.
- */
-extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid);
-extern void StandbyReleaseLockTree(TransactionId xid,
-								   int nsubxids, TransactionId *subxids);
-extern void StandbyReleaseAllLocks(void);
-extern void StandbyReleaseOldLocks(TransactionId oldxid);
-
 #define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
 
 
@@ -91,6 +78,19 @@ typedef struct RunningTransactionsData
 
 typedef RunningTransactionsData *RunningTransactions;
 
+/*
+ * Standby Rmgr (RM_STANDBY_ID)
+ *
+ * Standby recovery manager exists to perform actions that are required
+ * to make hot standby work. That includes logging AccessExclusiveLocks taken
+ * by transactions and running-xacts snapshots.
+ */
+extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid);
+extern void StandbyReleaseLockTree(TransactionId xid,
+								   int nsubxids, TransactionId *subxids);
+extern void StandbyReleaseAllLocks(void);
+extern void StandbyReleaseOldLocks(TransactionId oldxid);
+
 extern void LogAccessExclusiveLock(Oid dbOid, Oid relOid);
 extern void LogAccessExclusiveLockPrepare(void);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888e..1fda5b06f67 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -181,6 +181,13 @@ typedef struct SnapshotData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.2

Kirill Reshke

reshkekirill@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#1)

1 attachment(s)

Re: CSN snapshots in hot standby

Hi,

On Thu, 4 Apr 2024 at 22:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

You cannot run queries on a Hot Standby server until the standby has
seen a running-xacts record. Furthermore if the subxids cache had
overflowed, you also need to wait for those transactions to finish. That
is usually not a problem, because we write a running-xacts record after
each checkpoint, and most systems don't use so many subtransactions that
the cache would overflow. Still, you can run into it if you're unlucky,
and it's annoying when you do.

It occurred to me that we could replace the known-assigned-xids
machinery with CSN snapshots. We've talked about CSN snapshots many
times in the past, and I think it would make sense on the primary too,
but for starters, we could use it just during Hot Standby.

With CSN-based snapshots, you don't have the limitation with the
fixed-size known-assigned-xids array, and overflowed sub-XIDs are not a
problem either. You can always enter Hot Standby and start accepting
queries as soon as the standby is in a physically consistent state.

I dusted up and rebased the last CSN patch that I found on the mailing
list [1], and modified it so that it's only used during recovery. That
makes some things simpler and less scary. There are no changes to how
transaction commit happens in the primary, the CSN log is only kept
up-to-date in the standby, when commit/abort records are replayed. The
CSN of each transaction is the LSN of its commit record.

The CSN approach is much simpler than the existing known-assigned-XIDs
machinery, as you can see from "git diff --stat" with this patch:

32 files changed, 773 insertions(+), 1711 deletions(-)

With CSN snapshots, we don't need the known-assigned-XIDs machinery, and
we can get rid of the xact-assignment records altogether. We no longer
need the running-xacts records for Hot Standby either, but I wasn't able
to remove that because it's still used by logical replication, in
snapbuild.c. I have a feeling that that could somehow be simplified too,
but didn't look into it.

This is obviously v18 material, so I'll park this at the July commitfest
for now. There are a bunch of little FIXMEs in the code, and needs
performance testing, but overall I was surprised how easy this was.

(We ran into this issue particularly hard with Neon, because with Neon
you don't need to perform WAL replay at standby startup. However, when
you don't perform WAL replay, you don't get to see the running-xact
record after the checkpoint either. If the primary is idle, it doesn't
generate new running-xact records, and the standby cannot start Hot
Standby until the next time something happens in the primary. It's
always a potential problem with overflowed sub-XIDs cache, but the lack
of WAL replay made it happen even when there are no subtransactions
involved.)

[1]
/messages/by-id/2020081009525213277261@highgo.ca

--
Heikki Linnakangas
Neon (https://neon.tech)

Great. I really like the idea of vanishing KnownAssignedXids instead of
optimizing it (if optimizations are even possible).

+ /*
+ * TODO: We must mark CSNLOG first
+ */
+ CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+

As far as I understand we simply use the current Wal Record LSN as its XID
CSN number. Ok.
This seems to work for standbys snapshots, but this patch may be really
useful for distributed postgresql solutions, that use CSN for working
with distributed database snapshot (across multiple shards). These
solutions need to set CSN to some other value (time from True time/ClockSI
or whatever).
So, maybe we need some hooks here? Or maybe, we can take CSN here from
extension somehow. For example, we can define
some interface and extend it. Does this sound reasonable for you?

Also, I attached a patch which adds some more todos.

Attachments:

v1-0001-Point-comments-needed-to-be-edited.patchapplication/octet-stream; name=v1-0001-Point-comments-needed-to-be-edited.patchDownload

From 4a361053b5947fa209fadd9d95cc9213e4052e31 Mon Sep 17 00:00:00 2001
From: reshke <reshke@double.cloud>
Date: Thu, 4 Apr 2024 20:50:54 +0000
Subject: [PATCH v1] Point comments needed to be edited

---
 contrib/pg_visibility/pg_visibility.c | 1 +
 src/backend/access/transam/xact.c     | 1 +
 src/backend/storage/ipc/standby.c     | 1 +
 3 files changed, 3 insertions(+)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index cb0c49c7a4..c5f46a3fc7 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -548,6 +548,7 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    databases that were ignored before.
  * 2. Ignore KnownAssignedXids, because they are not database-aware. At the
  *    same time, the primary could compute its horizons database-aware.
+ * XXX: KnownAssignedXids is gone so the above comment needs updating.
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5456954602..c81b9362e2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1378,6 +1378,7 @@ RecordTransactionCommit(void)
 	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
 	 * might be OK to skip it only when wal_level < replica, but for now we
 	 * don't.)
+	 * XXX: KnownAssignedXids is gone so the above comment needs updating.
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 078ed53acd..a5d8a846cd 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1267,6 +1267,7 @@ standby_redo(XLogReaderState *record)
  * an entry are expected and must not cause an error when we are in state
  * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
  * KnownAssignedXidsRemove().
+ * XXX: KnownAssignedXids is gone so the above comment needs updating.
  *
  * Later, when we apply the running xact data we must be careful to ignore
  * transactions already committed, since those commits raced ahead when
-- 
2.25.1

Andrey M. Borodin

x4mmm@yandex-team.ru

almost 2 years ago

In reply to: Kirill Reshke (#2)

Re: CSN snapshots in hot standby

On 5 Apr 2024, at 02:08, Kirill Reshke <reshkekirill@gmail.com> wrote:

maybe we need some hooks here? Or maybe, we can take CSN here from extension somehow.

I really like the idea of CSN-provider-as-extension.
But it's very important to move on with CSN, at least on standby, to make CSN actually happen some day.
So, from my perspective, having LSN-as-CSN is already huge step forward.

Best regards, Andrey Borodin.

Heikki Linnakangas

hlinnaka@iki.fi

over 1 year ago

In reply to: Andrey M. Borodin (#3)

6 attachment(s)

Re: CSN snapshots in hot standby

On 05/04/2024 13:49, Andrey M. Borodin wrote:

On 5 Apr 2024, at 02:08, Kirill Reshke <reshkekirill@gmail.com> wrote:

Thanks for taking a look, Kirill!

maybe we need some hooks here? Or maybe, we can take CSN here from extension somehow.

I really like the idea of CSN-provider-as-extension.
But it's very important to move on with CSN, at least on standby, to make CSN actually happen some day.
So, from my perspective, having LSN-as-CSN is already huge step forward.

Yeah, I really don't want to expand the scope of this.

Here's a new version. Rebased, and lots of comments updated.

I added a tiny cache of the CSN lookups into SnapshotData, which can
hold the values of 4 XIDs that are known to be visible to the snapshot,
and 4 invisible XIDs. This is pretty arbitrary, but the idea is to have
something very small to speed up the common cases that 1-2 XIDs are
repeatedly looked up, without adding too much overhead.

I did some performance testing of the visibility checks using these CSN
snapshots. The tests run SELECTs with a SeqScan in a standby, over a
table where all the rows have xmin/xmax values that are still
in-progress in the primary.

Three test scenarios:

1. large-xact: one large transaction inserted all the rows. All rows
have the same XMIN, which is still in progress

2. many-subxacts: one large transaction inserted each row in a separate
subtransaction. All rows have a different XMIN, but they're all
subtransactions of the same top-level transaction. (This causes the
subxids cache in the proc array to overflow)

3. few-subxacts: All rows are inserted, committed, and vacuum frozen.
Then, using 10 in separate subtransactions, DELETE the rows, in an
interleaved fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 1, 2, 3, 4, 5, ...". The point of this is that these
sub-XIDs fit in the subxids cache in the procarray, but the pattern
defeats the simple 4-element cache that I added.

The test script I used is attached. I repeated it a few times with
master and the patches here, and picked the fastest runs for each. Just
eyeballing the results, there's about ~10% variance in these numbers.
Smaller is better.

Master:

large-xact: 4.57732510566711
many-subxacts: 18.6958119869232
few-subxacts: 16.467698097229

Patched:

large-xact: 10.2999930381775
many-subxacts: 11.6501438617706
few-subxacts: 19.8457028865814

With cache:

large-xact: 3.68792295455933
many-subxacts: 13.3662350177765
few-subxacts: 21.4426419734955

The 'large-xacts' results show that the CSN lookups are slower than the
binary search on the 'xids' array. Not a surprise. The 4-element cache
fixes the regression, which is also not a surprise.

The 'many-subxacts' results show that the CSN lookups are faster than
the current method in master, when the subxids cache has overflowed.
That makes sense: on master, we always perform a lookup in pg_subtrans,
if the suxids cache has overflowed, which is more or less the same
overhead as the CSN lookup. But we avoid the binary search on the xids
array after that.

The 'few-subxacts' shows a regression, when the 4-element cache is not
effective. I think that's acceptable, the CSN approach has many
benefits, and I don't think this is a very common scenario. But if
necessary, it could perhaps be alleviated with more caching, or by
trying to compensate by optimizing elsewhere.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v2-0001-Update-outdated-comment-on-WAL-logged-locks-with-.patchtext/x-patch; charset=UTF-8; name=v2-0001-Update-outdated-comment-on-WAL-logged-locks-with-.patchDownload

From 4bc7a2b3c9b7437871a22caa8f5ee8548face4dd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:47:42 +0300
Subject: [PATCH v2 1/6] Update outdated comment on WAL-logged locks with
 invalid XID

We haven't generated those for a long time.
---
 src/backend/storage/ipc/standby.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 872679ca44..25267f0f85 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1121,6 +1121,9 @@ StandbyReleaseAllLocks(void)
  * StandbyReleaseOldLocks
  *		Release standby locks held by top-level XIDs that aren't running,
  *		as long as they're not prepared transactions.
+ *
+ * This is needed to prune the locks of crashed transactions, which didn't
+ * write an ABORT/COMMIT record.
  */
 void
 StandbyReleaseOldLocks(TransactionId oldxid)
@@ -1266,13 +1269,6 @@ standby_redo(XLogReaderState *record)
  * transactions already committed, since those commits raced ahead when
  * making WAL entries.
  *
- * The loose timing also means that locks may be recorded that have a
- * zero xid, since xids are removed from procs before locks are removed.
- * So we must prune the lock list down to ensure we hold locks only for
- * currently running xids, performed by StandbyReleaseOldLocks().
- * Zero xids should no longer be possible, but we may be replaying WAL
- * from a time when they were possible.
- *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
  * as there's no independent knob to just enable logical decoding. For
-- 
2.39.2

v2-0002-XXX-add-perf-test.patchtext/x-patch; charset=UTF-8; name=v2-0002-XXX-add-perf-test.patchDownload

From f2478ab2dfe7ff45b8fb01ba016d4d0cfa8a909e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 13:27:03 +0300
Subject: [PATCH v2 2/6] XXX: add perf test

This is not intended to be merged. But it's been useful to have this
in the tree for some quick perf testing during development.

To run it, I've used:

(cd build-release && ninja &&  rm -rf tmp_install && meson test --suite setup --suite test_misc; grep TEST testrun/test_misc/000_csn_perf/log/regress_log_000_csn_perf )

It runs the other test_misc tests concurrently, but they finish a lot
faster so they don't affect the results much.
---
 src/test/modules/test_misc/meson.build       |   1 +
 src/test/modules/test_misc/t/000_csn_perf.pl | 139 +++++++++++++++++++
 2 files changed, 140 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/000_csn_perf.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa751a..e55e80af54 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
+      't/000_csn_perf.pl',
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
diff --git a/src/test/modules/test_misc/t/000_csn_perf.pl b/src/test/modules/test_misc/t/000_csn_perf.pl
new file mode 100644
index 0000000000..4ad7d7e5eb
--- /dev/null
+++ b/src/test/modules/test_misc/t/000_csn_perf.pl
@@ -0,0 +1,139 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Verify that ALTER TABLE optimizes certain operations as expected
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(time);
+
+# Initialize a test cluster
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+# Turn message level up to DEBUG1 so that we get the messages we want to see
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->start;
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->append_conf('postgresql.conf', "shared_buffers='1 GB'");
+$replica->start;
+
+sub wait_catchup
+{
+	my ($primary, $replica) = @_;
+	
+	my $primary_lsn =
+	  $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+	my $caughtup_query =
+	  "SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+	$replica->poll_query_until('postgres', $caughtup_query)
+	  or die "Timed out while waiting for standby to catch up";
+}
+
+sub repeat_and_time_sql
+{
+  	my ($name, $node, $repeats, $sql) = @_;
+
+    my $begin_time = time();
+
+	local $ENV{PGOPTIONS} = "-c max_parallel_workers_per_gather=0";
+	$node->pgbench(
+		"--no-vacuum --client=1 --protocol=prepared --transactions=$repeats",
+	0,
+	[qr{processed: $repeats/$repeats}],
+	[qr{^$}],
+	$name,
+	{
+		"000_csn_perf_$name" => $sql
+	});
+
+	my $end_time = time();
+	my $elapsed = $end_time - $begin_time;
+
+	pass ("TEST $name: $elapsed");
+}
+
+# TEST 1: A transaction is open in primary that inserted a lot of
+# rows. SeqScan the table on the replica. It sees all the XIDs as not
+# in-progress
+
+$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+$primary_session->query_safe("BEGIN;");
+$primary_session->query_safe("INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+
+# Consume one more XID, to bump up "last committed XID"
+$primary->safe_psql('postgres', "select txid_current()");
+
+wait_catchup($primary, $replica);
+
+repeat_and_time_sql("large-xact", $replica, 5000, "select count(*) from tbl");
+
+$primary_session->quit;
+$primary->safe_psql('postgres', "DROP TABLE tbl");
+
+# TEST 2: Like 'large-xact', but with lots of subxacts
+
+$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+$primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+$primary_session->query_safe("BEGIN;");
+$primary_session->query_safe(q{
+do $$
+  begin
+    for i in 1..100000 loop
+      begin
+        insert into tbl values (i);
+      exception
+        when others then raise 'fail: %', sqlerrm;
+      end;
+    end loop;
+  end
+$$;
+});
+
+# Consume one more XID, to bump up "last committed XID"
+$primary->safe_psql('postgres', "select txid_current()");
+
+wait_catchup($primary, $replica);
+
+repeat_and_time_sql("many-subxacts", $replica, 5000, "select count(*) from tbl");
+
+$primary_session->quit;
+$primary->safe_psql('postgres', "DROP TABLE tbl");
+
+
+# TEST 3: A mix of a handful of different subxids
+
+$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+$primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+$primary_session->query_safe("INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+$primary_session->query_safe("VACUUM FREEZE tbl;");
+$primary_session->query_safe("BEGIN;");
+
+my $batches = 10;
+for(my $i = 0; $i < $batches; $i++) {
+	$primary_session->query_safe("SAVEPOINT sp$i");
+	$primary_session->query_safe("DELETE FROM tbl WHERE i % $batches = $i");
+}
+
+# Consume one more XID, to bump up "last committed XID"
+$primary->safe_psql('postgres', "select txid_current()");
+
+wait_catchup($primary, $replica);
+
+repeat_and_time_sql("few-subxacts", $replica, 5000, "select count(*) from tbl");
+
+$primary_session->quit;
+$primary->safe_psql('postgres', "DROP TABLE tbl");
+
+
+done_testing();
-- 
2.39.2

v2-0003-Use-CSN-snapshots-during-Hot-Standby.patchtext/x-patch; charset=UTF-8; name=v2-0003-Use-CSN-snapshots-during-Hot-Standby.patchDownload

From 216c32bc9041df74d43e170654d2e4a1eb8195ed Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:26:40 +0300
Subject: [PATCH v2 3/6] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |    2 +
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  473 ++++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    1 +
 src/backend/access/transam/xact.c             |  138 +-
 src/backend/access/transam/xlog.c             |  118 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1512 ++---------------
 src/backend/storage/ipc/standby.c             |  102 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   37 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/utils/snapshot.h                  |    7 +
 31 files changed, 821 insertions(+), 1768 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 1a1a4ff7be..3e096b99e3 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -548,6 +548,8 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    databases that were ignored before.
  * 2. Ignore KnownAssignedXids, because they are not database-aware. At the
  *    same time, the primary could compute its horizons database-aware.
+ *    XXX KnownAssignedXids is gone. But see how this plays out:
+ *    https://www.postgresql.org/message-id/42218c4f-2c8d-40a3-8743-4d34dd0e4cce%40iki.fi
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index dccca201e0..cbcde73a9f 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -421,17 +421,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -459,18 +448,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -502,9 +479,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..2520d77c7c 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..723d3b633f
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,473 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit record LSNs of finished transactions
+ *
+ * This module provides an SLRU to store the LSN of the commit record of each
+ * transaction. CSN stands for Commit Sequence Number, and in principle we
+ * could use a separate counter that is incremented at every commit. For
+ * simplicity, though, we use the commit records LSN as the sequence number.
+ *
+ * Like pg_subtrans, this mapping need to be kept only for xid's greater then
+ * oldestXmin, and doesn't need to be preserved over crashes.  Also, this is
+ * only needed in hot standby mode, and immediately after exiting hot standby
+ * mode, until all old snapshots taken during standby mode are gone.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CSNLog control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+								TransactionId *subxids,
+								XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+							   int slotno);
+
+
+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the CSN log for all
+ * entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as CSNLogSetCSN()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids, TransactionId *subxids,
+					XLogRecPtr commitLsn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i], commitLsn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, commitLsn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	// FIXME: skip if not InHotStandby?
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	//SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ * All committed transactions are stamped with 'csn'
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus	status = TransactionIdGetStatus(*xid, &dummy);
+
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid, and after
+ * initializing the CLOG.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'. ('csn'
+ * is supposed to be an "older than anything we'll ever need to compare with")
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we initialize
+	 * the currently-active page(s) to zeroes during startup. Whenever we
+	 * advance into a new page, ExtendCSNLog will likewise zero the new page
+	 * without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..cf41df2971 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 75b5325df8..93c4d495e4 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have the CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e98286d768..9cea5ad2e9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1953,20 +1954,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -1994,34 +1988,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fb6a86afcb..a37d17886c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dfc8cf2dcf..09ed510989 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -209,7 +210,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -249,13 +249,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -531,18 +524,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -635,7 +616,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -679,20 +659,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -728,59 +694,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData((char *) unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1470,11 +1383,11 @@ RecordTransactionCommit(void)
 	 * temp tables will be lost anyway, unlogged tables will be truncated and
 	 * HOT pruning will be done again later. (Given the foregoing, you might
 	 * think that it would be unnecessary to emit the XLOG record at all in
-	 * this case, but we don't currently try to do that.  It would certainly
-	 * cause problems at least in Hot Standby mode, where the
-	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
-	 * might be OK to skip it only when wal_level < replica, but for now we
-	 * don't.)
+	 * this case, but we don't currently try to do that.  It might cause
+	 * inefficiencies in Hot Standby mode, if nothing else, where the
+	 * commit/abort records allow advancing the xmin horizon for new
+	 * snapshots. It might be OK to skip it only when wal_level < replica, but
+	 * for now we don't.)
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
@@ -1942,13 +1855,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2131,12 +2037,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6141,7 +6041,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6161,6 +6061,12 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * Mark the CSNLOG first.  The transaction won't become visible to new
+		 * snapshots until the call to ProcArrayRecoveryEndTransaction().
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6173,9 +6079,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * Make the commit visible to new snapshots in the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6281,7 +6187,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6299,13 +6205,15 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/* Note: we don't need to update the CSN log on abort. */
+
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
 
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6413,14 +6321,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ee0fb0e28f..b1af9332a3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -950,8 +951,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5153,6 +5152,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5754,16 +5754,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5778,39 +5778,17 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
-			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
-			 */
-			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
 
-				/* Update pg_subtrans entries for any prepared transactions */
-				StandbyRecoverPreparedTransactions();
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
 
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_status = SUBXIDS_IN_SUBTRANS;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-			}
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
+			/* Update pg_subtrans entries for any prepared transactions */
+			if (wasShutdown)
+				StandbyRecoverPreparedTransactions();
 		}
 
 		/*
@@ -5894,7 +5872,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -6060,9 +6038,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6154,12 +6141,12 @@ StartupXLOG(void)
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
 	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
-	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * any session building a snapshot will not rely on the CSN log as
 	 * RecoveryInProgress() would return false at this stage.  This is
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -6926,7 +6913,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7318,7 +7305,10 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
@@ -7489,6 +7479,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7785,7 +7776,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8270,41 +8264,17 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			/* Update pg_subtrans entries for any prepared transactions */
 			StandbyRecoverPreparedTransactions();
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_status = SUBXIDS_IN_SUBTRANS;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
@@ -8368,6 +8338,16 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Remember the oldest XID that was running at the time.  Normally,
+		 * all transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ad817fbca6..324a935d77 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1986,10 +1986,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2265,7 +2264,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3710,9 +3709,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3979,9 +3975,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe0..bf08c60e93 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index ef6f98ebcd..a975865fdd 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..caae8f75c2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e6..ea2f8e25cd 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 35fa2e1dda..ecb5d81543 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -124,6 +125,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -283,6 +285,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index af3b15e93d..a6e11dece2 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -73,22 +64,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -99,6 +76,21 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+#define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+
+/*
+ * TOTAL_MAX_CACHED_SUBXIDS is the total number of XIDs that fits in the proc
+ * array, as top XIDs and in the subxids caches.
+ *
+ * Local data structures are also created in various backends during
+ * GetSnapshotData(), TransactionIdIsInProgress() and
+ * GetRunningTransactionData(). All of the main structures created in those
+ * functions must be identically sized, since we may at times copy the whole
+ * of the data structures around.
+ */
+#define TOTAL_MAX_CACHED_SUBXIDS \
+	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+
 /*
  * State for the GlobalVisTest* family of functions. Those functions can
  * e.g. be used to decide if a deleted row can be removed without violating
@@ -254,17 +246,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -278,17 +259,8 @@ static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -315,7 +287,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -325,7 +297,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -338,28 +310,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -383,31 +339,6 @@ ProcArrayShmemSize(void)
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
-	/*
-	 * During Hot Standby processing we have a data structure called
-	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetSnapshotData(),
-	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
-	 * main structures created in those functions must be identically sized,
-	 * since we may at times copy the whole of the data structures around. We
-	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
-	 *
-	 * Ideally we'd only create this structure if we were actually doing hot
-	 * standby in the current run, but we don't know that yet at the time
-	 * shared memory is being set up.
-	 */
-#define TOTAL_MAX_CACHED_SUBXIDS \
-	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
-
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -434,31 +365,12 @@ CreateSharedProcArray(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1022,355 +934,35 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
 	/*
 	 * Remove stale locks, if any.
 	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (running->subxid_status != SUBXIDS_MISSING || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
+	StandbyReleaseOldLocks(oldestRunningXID);
 
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_status == SUBXIDS_MISSING)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-
-		/*
-		 * If the 'xids' array didn't include all subtransactions, we have to
-		 * mark any snapshots taken as overflowed.
-		 */
-		if (running->subxid_status == SUBXIDS_IN_SUBTRANS)
-			procArray->lastOverflowedXid = latestObservedXid;
-		else
-		{
-			Assert(running->subxid_status == SUBXIDS_IN_ARRAY);
-			procArray->lastOverflowedXid = InvalidTransactionId;
-		}
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1378,23 +970,24 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction is known to have committed
+ * or aborted, otherwise it's considered as running.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * if that is running according to ProcGlobal->xids[].
  * This is the slowest way, but sadly it has to be done always if the others
  * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
- * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
- * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
+ * ProcArrayLock has to be held while we do 2 and 3.  If we save the top Xids
+ * while doing 2 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
  * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
@@ -1435,6 +1028,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1451,12 +1066,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1551,33 +1161,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1851,8 +1434,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1865,11 +1447,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2188,7 +1773,7 @@ GetSnapshotData(Snapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2368,27 +1953,8 @@ GetSnapshotData(Snapshot snapshot)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2396,14 +1962,17 @@ GetSnapshotData(Snapshot snapshot)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * (XXX: not sure that's strictly required.)
+		 * This is what determines which transactions we consider finished and
+		 * which are still in progress.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2519,6 +2088,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->lsn = InvalidXLogRecPtr;
 	snapshot->whenTaken = 0;
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2674,9 +2245,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * The returned data structure is statically allocated; caller should not
  * modify it, and must not assume it is valid past the next call.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
- *
  * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
@@ -2707,6 +2275,7 @@ GetRunningTransactionData(void)
 	int			subcount;
 	bool		suboverflowed;
 
+	/* This is never executed during recovery */
 	Assert(!RecoveryInProgress());
 
 	/*
@@ -2873,15 +2442,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2902,11 +2472,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2915,6 +2487,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2993,8 +2568,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
-	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
-	 * machinery can miss values and return an older value than is safe.
+	 * *not* use oldest_running_primary_xid since the XID tracking machinery
+	 * can miss values and return an older value than is safe.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3412,6 +2987,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4083,14 +3661,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4337,61 +3915,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4406,7 +3929,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4424,38 +3947,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4467,781 +3971,61 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 }
 
 /*
- * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
+ * ProcArrayRecoveryEndTransaction
+ *
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction(). The transaction becomes visible to any new
+ * snapshots taken after this. 'max_xid' is the highest (sub)XID of the
+ * committed transaction, and 'lsn' is LSN of the commit record.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * The transaction and all its subtransactions have been already marked as
+ * committed in the CLOG and in the CSNLOG.
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * If this was the the oldest XID that was still running, advance it.
+	 * This is important for advancing the global xmin, which avoids
+	 * unnecessary recovery conflicts
+	 *
+	 * No locking required because this runs in the startup process.
+	 *
+	 * XXX: the caller actually has a list of XIDs that just committed. We
+	 * could save some clog lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25267f0f85..e02c9ab842 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -1167,7 +1162,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1182,18 +1177,21 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
 
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_status = xlrec->subxid_overflow ? SUBXIDS_MISSING : SUBXIDS_IN_ARRAY;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
-
-		ProcArrayApplyRecoveryInfo(&running);
+		/*
+		 * Remember the oldest XID that was running at the time. Normally, all
+		 * transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1224,50 +1222,46 @@ standby_redo(XLogReaderState *record)
  *
  * This is used for Hot Standby as follows:
  *
- * We can move directly to STANDBY_SNAPSHOT_READY at startup if we
- * start from a shutdown checkpoint because we know nothing was running
- * at that time and our recovery snapshot is known empty. In the more
- * typical case of an online checkpoint we need to jump through a few
- * hoops to get a correct recovery snapshot and this requires a two or
- * sometimes a three stage process.
+ * We can enter hot standby mode and start accepting read-only queries
+ * immediately at startup if we start from a shutdown checkpoint, because we
+ * know nothing was running at that time and our recovery snapshot is known
+ * empty. In the more typical case of an online checkpoint, the checkpoint
+ * record doesn't contain all the necessary information about running
+ * transaction state, and we need to jump through a few hoops to get a correct
+ * recovery snapshot.
  *
- * The initial snapshot must contain all running xids and all current
- * AccessExclusiveLocks at a point in time on the standby. Assembling
- * that information while the server is running requires many and
- * various LWLocks, so we choose to derive that information piece by
- * piece and then re-assemble that info on the standby. When that
- * information is fully assembled we move to STANDBY_SNAPSHOT_READY.
+ * The initial snapshot must contain all current AccessExclusiveLocks at a
+ * point in time on the standby. Assembling that information while the server
+ * is running requires many and various LWLocks, so we choose to derive that
+ * information piece by piece and then re-assemble that info on the standby.
  *
- * Since locking on the primary when we derive the information is not
- * strict, we note that there is a time window between the derivation and
- * writing to WAL of the derived information. That allows race conditions
- * that we must resolve, since xids and locks may enter or leave the
- * snapshot during that window. This creates the issue that an xid or
- * lock may start *after* the snapshot has been derived yet *before* the
- * snapshot is logged in the running xacts WAL record. We resolve this by
- * starting to accumulate changes at a point just prior to when we derive
- * the snapshot on the primary, then ignore duplicates when we later apply
- * the snapshot from the running xacts record. This is implemented during
- * CreateCheckPoint() where we use the logical checkpoint location as
- * our starting point and then write the running xacts record immediately
- * before writing the main checkpoint WAL record. Since we always start
- * up from a checkpoint and are immediately at our starting point, we
- * unconditionally move to STANDBY_INITIALIZED. After this point we
- * must do 4 things:
+ * Since locking on the primary when we derive the information is not strict,
+ * there is a time window between the derivation and writing to WAL of the
+ * derived information. That allows race conditions that we must resolve,
+ * since xids and locks may enter or leave the snapshot during that
+ * window. This creates the issue that an xid or lock may start *after* the
+ * snapshot has been derived yet *before* the snapshot is logged in the
+ * running xacts WAL record. We resolve this by starting to accumulate changes
+ * at a point just prior to when we collect the lock information on the
+ * primary, then ignore duplicates when we later apply the snapshot from the
+ * running xacts record. This is implemented during CreateCheckPoint() where
+ * we use the logical checkpoint location as our starting point and then write
+ * the running xacts record immediately before writing the main checkpoint WAL
+ * record. Since we always start up from a checkpoint's redo pointer, we will
+ * always see a running-xacts record between before reaching the checkpoint
+ * record, and can immediately enter hot standby mode. After this point we
+ * must do 3 things:
  *	* move shared nextXid forwards as we see new xids
  *	* extend the clog and subtrans with each new xid
- *	* keep track of uncommitted known assigned xids
  *	* keep track of uncommitted AccessExclusiveLocks
  *
- * When we see a commit/abort we must remove known assigned xids and locks
- * from the completing transaction. Attempted removals that cannot locate
- * an entry are expected and must not cause an error when we are in state
- * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
- * KnownAssignedXidsRemove().
- *
- * Later, when we apply the running xact data we must be careful to ignore
- * transactions already committed, since those commits raced ahead when
- * making WAL entries.
+ * When we see a commit/abort we must advance oldest_running_primary_xid and
+ * remove locks from the completing transaction. Attempted removals that
+ * cannot locate an entry are expected and must not cause an error until we
+ * have seen the running-xacts record. (We don't throw an error even after
+ * that, because whatever the reason was, after the transaction has completed
+ * the issue has already been resolved anyway.) This is implemented in
+ * StandbyReleaseLocks().
  *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d80..ae29055935 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -130,6 +130,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -166,6 +167,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d10ca723dc..3aea62a49e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -359,6 +359,7 @@ WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index d772544377..ffbfae84b8 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..da82def846 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -48,6 +48,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -201,6 +202,7 @@ typedef struct SerializedSnapshotData
 	CommandId	curcid;
 	TimestampTz whenTaken;
 	XLogRecPtr	lsn;
+	XLogRecPtr	snapshotCsn;
 } SerializedSnapshotData;
 
 /*
@@ -1729,6 +1731,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.curcid = snapshot->curcid;
 	serialized_snapshot.whenTaken = snapshot->whenTaken;
 	serialized_snapshot.lsn = snapshot->lsn;
+	serialized_snapshot.snapshotCsn = snapshot->snapshotCsn;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -1803,6 +1806,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1913,36 +1917,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
+		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
 
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
-
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a015..79ad7d6996 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -249,7 +249,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..f8cdf573ae
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Mapping from XID to commit record's LSN (Commit Sequence Number).
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+						 TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif							/* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd..a7054fe11c 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr	latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b85b65c604..58ed0fc038 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..df0af5ea20 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -442,7 +433,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 20950ce033..19cb5f33bd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -27,37 +27,10 @@ extern PGDLLIMPORT bool ignore_invalid_pages;
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e0..c2156aca12 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -179,6 +179,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -215,6 +216,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 8ca6050462..7b7cbf47aa 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888..1fda5b06f6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -181,6 +181,13 @@ typedef struct SnapshotData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.2

v2-0004-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchtext/x-patch; charset=UTF-8; name=v2-0004-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchDownload

From 737996d139a192ad3de207380dad91cc4b4df8e8 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:01:07 +0300
Subject: [PATCH v2 4/6] Make SnapBuildWaitSnapshot work without
 xl_running_xacts.xids array

SnapBuildWaitSnapshot looped through all the XIDs in the
xl_running_xacts, waiting for them to finish. Change it to grab the
list of running XIDs from the proc array instead. This removes the
last usage of he XIDs array in the xl_running_xacts record, allowing
it to be removed in the next commit.

When SnapBuildWaitSnapshot() is called with running->nextXid as the
'cutoff' point, the new code should wait for exactly the same set of
transactions as before. But when called with initial_xmin_horizon as
the 'cutoff', this might wait for more transactions than before: those
between running->nextXid and initial_xmin_horizon. For example,
imagine that we see a running-xacts record with nextXid 100, and
initial_xmin_horizon is 200. Before, we would wait for all XIDs < 100
to complete, and then log the standby snapshot and proceed, but now we
will wait for all XIDs < 200. I believe that's a good thing, because
we won't actually be able to move to the next state in the snapshot
building until all transactions < 200 have completed. The
running-xacts snapshot that we logged after waiting up to XID 100
would not be useful to us either, if there are still XIDs between 100
and 200 running.

SnapBuildWaitSnapshot() used to do useless work when called in a
standby, because in a standby, there are no XID locks and the
XactLockTableWait() calls returned immediately, even if the XIDs were
in fact still running in the primary. But as the comment says, the
waiting isn't necessary for correctness, so that was harmless. In any
case, stop doing the futile work on a standby.
---
 src/backend/replication/logical/snapbuild.c | 50 ++++++++++++++-------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ea2f8e25cd..d5315efe2b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -307,7 +307,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
-static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
+static void SnapBuildWaitSnapshot(TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -1361,14 +1361,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		NormalTransactionIdPrecedes(running->oldestRunningXid,
 									builder->initial_xmin_horizon))
 	{
+		TransactionId cutoff;
+
 		ereport(DEBUG1,
 				(errmsg_internal("skipping snapshot at %X/%X while building logical decoding snapshot, xmin horizon too low",
 								 LSN_FORMAT_ARGS(lsn)),
 				 errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 									builder->initial_xmin_horizon, running->oldestRunningXid)));
 
-
-		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+		cutoff = builder->initial_xmin_horizon;
+		TransactionIdRetreat(cutoff);
+		SnapBuildWaitSnapshot(cutoff);
 
 		return true;
 	}
@@ -1455,7 +1458,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1479,7 +1482,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1512,8 +1515,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 }
 
 /* ---
- * Iterate through xids in record, wait for all older than the cutoff to
- * finish.  Then, if possible, log a new xl_running_xacts record.
+ * Wait for all transactions older than or equal to the cutoff to finish.
+ * Then, if possible, log a new xl_running_xacts record.
  *
  * This isn't required for the correctness of decoding, but to:
  * a) allow isolationtester to notice that we're currently waiting for
@@ -1523,13 +1526,31 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  * ---
  */
 static void
-SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+SnapBuildWaitSnapshot(TransactionId cutoff)
 {
-	int			off;
+	RunningTransactions running;
+
+	if (RecoveryInProgress())
+	{
+		/*
+		 * During recovery, we have no mechanism for waiting for an XID to
+		 * finish, and we cannot create new running-xacts records either.
+		 */
+		return;
+	}
+
+	running = GetRunningTransactionData();
+
+	/*
+	 * GetRunningTransactionData returns with XidGenLock and ProcArrayLock
+	 * held, but we don't need them.
+	 */
+	LWLockRelease(XidGenLock);
+	LWLockRelease(ProcArrayLock);
 
-	for (off = 0; off < running->xcnt; off++)
+	for (int i = 0; i < running->xcnt; i++)
 	{
-		TransactionId xid = running->xids[off];
+		TransactionId xid = running->xids[i];
 
 		/*
 		 * Upper layers should prevent that we ever need to wait on ourselves.
@@ -1539,7 +1560,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 		if (TransactionIdIsCurrentTransactionId(xid))
 			elog(ERROR, "waiting for ourselves");
 
-		if (TransactionIdFollows(xid, cutoff))
+		if (TransactionIdFollowsOrEquals(xid, cutoff))
 			continue;
 
 		XactLockTableWait(xid, NULL, NULL, XLTW_None);
@@ -1551,10 +1572,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 	 * wait for bgwriter or checkpointer to log one.  During recovery we can't
 	 * enforce that, so we'll have to wait.
 	 */
-	if (!RecoveryInProgress())
-	{
-		LogStandbySnapshot();
-	}
+	LogStandbySnapshot();
 }
 
 /* -----------------------------------
-- 
2.39.2

v2-0005-Remove-the-now-unused-xids-array-from-xl_running_.patchtext/x-patch; charset=UTF-8; name=v2-0005-Remove-the-now-unused-xids-array-from-xl_running_.patchDownload

From 3f05f22871a662d752512fd3d5a1637eb06857a4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 16:40:57 +0300
Subject: [PATCH v2 5/6] Remove the now-unused xids array from xl_running_xacts

We still generate running-xacts records, because they are still needed
to initialize the snapshot in logical decoding.
---
 src/backend/access/rmgrdesc/standbydesc.c   | 18 ------------
 src/backend/replication/logical/snapbuild.c |  8 +++---
 src/backend/storage/ipc/standby.c           | 32 +++++----------------
 src/include/storage/standby.h               |  2 --
 src/include/storage/standbydefs.h           | 16 +++++++----
 5 files changed, 21 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 25f870b187..bde9350b92 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -19,28 +19,10 @@
 static void
 standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
-	int			i;
-
 	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
 					 xlrec->oldestRunningXid);
-	if (xlrec->xcnt > 0)
-	{
-		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
-		for (i = 0; i < xlrec->xcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[i]);
-	}
-
-	if (xlrec->subxid_overflow)
-		appendStringInfoString(buf, "; subxid overflowed");
-
-	if (xlrec->subxcnt > 0)
-	{
-		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
-		for (i = 0; i < xlrec->subxcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
-	}
 }
 
 void
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index d5315efe2b..2ec97e460b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1455,8 +1455,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial starting point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
@@ -1479,8 +1479,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial consistent point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index e02c9ab842..6ed46bed03 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1337,9 +1337,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xl_running_xacts xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.xcnt = CurrRunningXacts->xcnt;
-	xlrec.subxcnt = CurrRunningXacts->subxcnt;
-	xlrec.subxid_overflow = (CurrRunningXacts->subxid_status != SUBXIDS_IN_ARRAY);
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
@@ -1347,31 +1344,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	/* Header */
 	XLogBeginInsert();
 	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData((char *) (&xlrec), MinSizeOfXactRunningXacts);
-
-	/* array of TransactionIds */
-	if (xlrec.xcnt > 0)
-		XLogRegisterData((char *) CurrRunningXacts->xids,
-						 (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
+	XLogRegisterData((char *) (&xlrec), SizeOfXactRunningXacts);
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
-	if (xlrec.subxid_overflow)
-		elog(DEBUG2,
-			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
-	else
-		elog(DEBUG2,
-			 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+	elog(DEBUG2,
+		 "logging running transaction bounds (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+		 LSN_FORMAT_ARGS(recptr),
+		 CurrRunningXacts->oldestRunningXid,
+		 CurrRunningXacts->latestCompletedXid,
+		 CurrRunningXacts->nextXid);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index cce0bc521e..9d5a298a39 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -60,8 +60,6 @@ extern void StandbyReleaseLockTree(TransactionId xid,
 extern void StandbyReleaseAllLocks(void);
 extern void StandbyReleaseOldLocks(TransactionId oldxid);
 
-#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
-
 
 /*
  * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index fe12f463a8..d858209447 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -42,20 +42,24 @@ typedef struct xl_standby_locks
 } xl_standby_locks;
 
 /*
- * When we write running xact data to WAL, we use this structure.
+ * Data included in an XLOG_RUNNING_XACTS record.
+ *
+ * This used to include a list of running XIDs, hence the name, but nowadays
+ * this only contains the min and max bounds of the transactions that were
+ * running when the record was written.  They are needed to initialize logical
+ * decoding.  They are also used in hot standby to prune information about old
+ * running transactions, in case the the primary didn't write a COMMIT/ABORT
+ * record for some reason.
  */
 typedef struct xl_running_xacts
 {
-	int			xcnt;			/* # of xact ids in xids[] */
-	int			subxcnt;		/* # of subxact ids in xids[] */
-	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* xid from TransamVariables->nextXid */
 	TransactionId oldestRunningXid; /* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
-
-	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
 
+#define SizeOfXactRunningXacts sizeof(xl_running_xacts)
+
 /*
  * Invalidations for standby, currently only when transactions without an
  * assigned xid commit.
-- 
2.39.2

v2-0006-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patchtext/x-patch; charset=UTF-8; name=v2-0006-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patchDownload

From 408c97782a9dd21170622630f4c027947a09155a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 18:15:49 +0300
Subject: [PATCH v2 6/6] Add a small cache to Snapshot to avoid CSN lookups

Keep the status of a few recently-looked up XIDs cached in the
SnapshotData. This avoids having to go the CSN log in the common case
that the same XIDs are looked up over and over again.
---
 src/backend/utils/time/snapmgr.c | 28 +++++++++++++++++++++++++++-
 src/include/utils/snapshot.h     |  4 ++++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index da82def846..e2b65e0dd5 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1807,6 +1807,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
 	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
+	memset(snapshot->visible_cache, 0, sizeof(snapshot->visible_cache));
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1917,12 +1918,37 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
+		XLogRecPtr	csn;
 
+		/* see if we have this cached */
+		for (int i = 0; i < VISIBLE_CACHE_XACTS; i++)
+		{
+			if (snapshot->visible_cache[i] == xid)
+				return true;
+		}
+		for (int i = 0; i < VISIBLE_CACHE_XACTS; i++)
+		{
+			if (snapshot->invisible_cache[i] == xid)
+				return false;
+		}
+
+		csn = CSNLogGetCSNByXid(xid);
 		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+		{
+			static uint8 last = 0;
+
+			snapshot->invisible_cache[last % VISIBLE_CACHE_XACTS] = xid;
+			last++;
 			return false;
+		}
 		else
+		{
+			static uint8 last = 0;
+
+			snapshot->visible_cache[last % VISIBLE_CACHE_XACTS] = xid;
+			last++;
 			return true;
+		}
 	}
 
 	return false;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 1fda5b06f6..88cfce2ffe 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -188,6 +188,10 @@ typedef struct SnapshotData
 	 */
 	XLogRecPtr	snapshotCsn;
 
+#define VISIBLE_CACHE_XACTS 4
+	TransactionId visible_cache[VISIBLE_CACHE_XACTS];
+	TransactionId invisible_cache[VISIBLE_CACHE_XACTS];
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.2

Kirill Reshke

reshkekirill@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#4)

Re: CSN snapshots in hot standby

On Wed, 14 Aug 2024 at 01:13, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/04/2024 13:49, Andrey M. Borodin wrote:

On 5 Apr 2024, at 02:08, Kirill Reshke <reshkekirill@gmail.com> wrote:

Thanks for taking a look, Kirill!

maybe we need some hooks here? Or maybe, we can take CSN here from extension somehow.

I really like the idea of CSN-provider-as-extension.
But it's very important to move on with CSN, at least on standby, to make CSN actually happen some day.
So, from my perspective, having LSN-as-CSN is already huge step forward.

Yeah, I really don't want to expand the scope of this.

Here's a new version. Rebased, and lots of comments updated.

I added a tiny cache of the CSN lookups into SnapshotData, which can
hold the values of 4 XIDs that are known to be visible to the snapshot,
and 4 invisible XIDs. This is pretty arbitrary, but the idea is to have
something very small to speed up the common cases that 1-2 XIDs are
repeatedly looked up, without adding too much overhead.

I did some performance testing of the visibility checks using these CSN
snapshots. The tests run SELECTs with a SeqScan in a standby, over a
table where all the rows have xmin/xmax values that are still
in-progress in the primary.

Three test scenarios:

1. large-xact: one large transaction inserted all the rows. All rows
have the same XMIN, which is still in progress

2. many-subxacts: one large transaction inserted each row in a separate
subtransaction. All rows have a different XMIN, but they're all
subtransactions of the same top-level transaction. (This causes the
subxids cache in the proc array to overflow)

3. few-subxacts: All rows are inserted, committed, and vacuum frozen.
Then, using 10 in separate subtransactions, DELETE the rows, in an
interleaved fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 1, 2, 3, 4, 5, ...". The point of this is that these
sub-XIDs fit in the subxids cache in the procarray, but the pattern
defeats the simple 4-element cache that I added.

The test script I used is attached. I repeated it a few times with
master and the patches here, and picked the fastest runs for each. Just
eyeballing the results, there's about ~10% variance in these numbers.
Smaller is better.

Master:

large-xact: 4.57732510566711
many-subxacts: 18.6958119869232
few-subxacts: 16.467698097229

Patched:

large-xact: 10.2999930381775
many-subxacts: 11.6501438617706
few-subxacts: 19.8457028865814

With cache:

large-xact: 3.68792295455933
many-subxacts: 13.3662350177765
few-subxacts: 21.4426419734955

The 'large-xacts' results show that the CSN lookups are slower than the
binary search on the 'xids' array. Not a surprise. The 4-element cache
fixes the regression, which is also not a surprise.

The 'many-subxacts' results show that the CSN lookups are faster than
the current method in master, when the subxids cache has overflowed.
That makes sense: on master, we always perform a lookup in pg_subtrans,
if the suxids cache has overflowed, which is more or less the same
overhead as the CSN lookup. But we avoid the binary search on the xids
array after that.

The 'few-subxacts' shows a regression, when the 4-element cache is not
effective. I think that's acceptable, the CSN approach has many
benefits, and I don't think this is a very common scenario. But if
necessary, it could perhaps be alleviated with more caching, or by
trying to compensate by optimizing elsewhere.

--
Heikki Linnakangas
Neon (https://neon.tech)

Thanks for the update. I will try to find time for perf-testing this.
Firstly, random suggestions. Sorry for being too nit-picky

1) in 0002

+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+ return Min(32, Max(16, NBuffers / 512));
+}

Should we GUC this?

2) In 0002 CSNLogShmemInit:

+ //SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);

remove this?

3) In 0002 InitCSNLogPage:

+ SimpleLruZeroPage(CsnlogCtl, pageno);

we can use ZeroCSNLogPage here. This will justify existance of this
function a little bit more.

4) In 0002:

+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
* removed. This is achieved by using the replication slot mechanism.
*
* As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
* the committed catalog modifying ones inside [xmin, xmax) instead of keeping
* track of all running transactions like it's done in a normal snapshot. Note
* that we're generally only looking at transactions that have acquired an

This change is unrelated to 0002 patch, let's just push it as a separate change.

Overall, 0002 looks straightforward, though big. I however wonder how
we can test that this change does not lead to any unpleasant problem,
like observing uncommitted changes on replicas, corruption, and other
stuff? Maybe some basic injection-point-based TAP test here is
desirable?

--
Best regards,
Kirill Reshke

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Heikki Linnakangas (#4)

Re: CSN snapshots in hot standby

Hi,

On 2024-08-13 23:13:39 +0300, Heikki Linnakangas wrote:

I added a tiny cache of the CSN lookups into SnapshotData, which can hold
the values of 4 XIDs that are known to be visible to the snapshot, and 4
invisible XIDs. This is pretty arbitrary, but the idea is to have something
very small to speed up the common cases that 1-2 XIDs are repeatedly looked
up, without adding too much overhead.

I did some performance testing of the visibility checks using these CSN
snapshots. The tests run SELECTs with a SeqScan in a standby, over a table
where all the rows have xmin/xmax values that are still in-progress in the
primary.

Three test scenarios:

1. large-xact: one large transaction inserted all the rows. All rows have
the same XMIN, which is still in progress

2. many-subxacts: one large transaction inserted each row in a separate
subtransaction. All rows have a different XMIN, but they're all
subtransactions of the same top-level transaction. (This causes the subxids
cache in the proc array to overflow)

3. few-subxacts: All rows are inserted, committed, and vacuum frozen. Then,
using 10 in separate subtransactions, DELETE the rows, in an interleaved
fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
2, 3, 4, 5, ...". The point of this is that these sub-XIDs fit in the
subxids cache in the procarray, but the pattern defeats the simple 4-element
cache that I added.

I'd like to see some numbers for a workload with many overlapping top-level
transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
whereas this patch would need to do csn lookups. And a four entry cache
probably wouldn't help very much.

+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}

Hm - is there any guarantee / documented requirement that subxids is sorted?

+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}

Hm. Maybe I'm missing something, but what prevents a concurrent transaction to
check the visibility of a subtransaction between marking the subtransaction
committed and marking the main transaction committed? If subtransaction and
main transaction are on the same page that won't be possible, but if they are
on different ones it does seem possible?

Today XidInMVCCSnapshot() will use pg_subtrans to find the top transaction in
case of a suboverflowed snapshot, but with this patch that's not the case
anymore. Which afaict will mean that repeated snapshot computations could
give different results for the same query?

Greetings,

Andres Freund

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andres Freund (#6)

5 attachment(s)

Re: CSN snapshots in hot standby

On 24/09/2024 21:08, Andres Freund wrote:

I'd like to see some numbers for a workload with many overlapping top-level
transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
whereas this patch would need to do csn lookups. And a four entry cache
probably wouldn't help very much.

I spent some more on the tests. Here is a better set of adversarial
tests, which hit the worst case scenarios for this patch.

All the test scenarios have this high-level shape:

1. Create a table with 100000 rows, vacuum freeze it.

2. In primary, open transactions or subtransactions, and DELETE all rows
using the different (sub)transactions, to set the xmax of every row on
the test table. Leave the transactions open.

3. In standby, SELECT COUNT(*) all rows in the table, and measure how
long it takes.

The difference between the test scenarios is in the pattern of xmax
values, i.e. how many transactions or subtransactions were used. All the
rows are visible, the performance differences come just from how
expensive the visibility checks are in different cases.

First, the results on 'master' without patches (smaller is better):

few-xacts: 0.0041 s / iteration
many-xacts: 0.0042 s / iteration
many-xacts-wide-apart: 0.0042 s / iteration
few-subxacts: 0.0042 s / iteration
many-subxacts: 0.0073 s / iteration
many-subxacts-wide-apart: 0.10 s / iteration

So even on master, there are significant differences depending on
whether the sub-XIDs fit in the in-memory caches, or if you need to do
lookups in pg_subtrans. That's not surprising. Note how bad the
"many-subxacts-wide-apart" scenario is, though. It's over 20x slower
than the best case scenario! I was a little taken aback by that. More on
that later.

Descriptions of the test scenarios:

few-xacts: The xmax values on the rows cycle through four different
XIDs, like this: 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...

many-xacts: like 'few-xacts', but cycle through 100 different XIDs.

many-xacts-wide-apart: like 'many-xacts', but the XIDs used are spread
out, so that there are 1000 unrelated committed XIDs in between each XID
used in the test table. I.e. "1000, 2000, 3000, 4000, 5000, ...". It
doesn't make a difference in the 'many-xacts-wide-apart' test, but in
the many-subxacts-wide-apart variant it does. It makes the XIDs fall on
different SLRU pages so that there are not enough SLRU buffers to hold
them all.

few-subxacts, many-subxacts, many-subxacts-wide-apart: Same tests, but
instead of using different top-level XIDs, all the XIDs are
subtransactions belonging to a single top-level XID.

Now, with the patch (the unpatched numbers are repeated here for
comparison):

master patched
few-xacts: 0.0041 0.0040 s / iteration
many-xacts: 0.0042 0.0053 s / iteration
many-xacts-wide-apart: 0.0042 0.17 s / iteration
few-subxacts: 0.0042 0.0040 s / iteration
many-subxacts: 0.0073 0.0052 s / iteration
many-subxacts-wide-apart: 0.10 0.22 s / iteration

So when the 4-element cache is effective, in the 'few-xacts' case, the
patch performs well. In the 'many-xacts' case, it needs to perform CSN
lookups, making it a little slower. The 'many-xacts-wide-apart'
regresses badly, showing the same SLRU trashing effect on CSN lookups as
the 'many-subxacts-wide-apart' case does on 'master' on pg_subtrans lookups.

Some thoughts on all this:

1. The many-subxacts-wide-apart performance is horrible even on master.
'perf' shows that about half of the CPU time is spent in open() and
close(). We open and close the SLRU file every time we need to read a
page! That's obviously silly, but also shouldn't be hard to fix.

2. Even if we fix the open/close issue and make the worst case 2x
faster, the worst case is still bad. We could call this a tuning issue;
more SLRU buffers helps. But that's not very satisfactory. I really wish
we could make SLRU buffers auto-tuning. Move them to the main buffer
pool. Or something. And I wish SLRU lookups were faster even in the case
that the SLRU page is already in memory. The LWLock acquire+release
shows up in profiles, maybe we could do some kind of optimistic locking
instead.

3. Aside from making SLRUs faster, we could also mask its slowness in
the CSN patch by caching. The 4-element cache in Snapshot that I
implemented is fast when it's sufficient, but we could make it larger to
cover more cases. At the extreme, we could never remove elements from
it, and just let it grow as large as needed.

4. Currently on 'master', the XID list in a snapshot is an array of XIDs
that is binary searched. A different data structure might be better.
When the difference between xmin and xmax is small, a bitmap would be
compact and fast to look up, for example. Or maybe a radix tree or
something. This is an independent optimization that might make
XidInMVCCSnapshot() faster even without the CSN stuff, but if we decide
to go with a large cache (see previous paragraph), it would be nice to
reduce the worst case memory usage of the cache with something like this.

5. I'm not sure how much any of this matters in practice. Performance is
obviously important, but we don't get too many complaints about these
things even though the 'many-subxacts-wide-apart' case is pretty bad
already. It was not that easy to construct these adversarial scenarios.
If we were implementing this from scratch, I think we could easily
accept the performance with the patch. Regressions can be very
unpleasant for existing users, however..

Thoughts? I think the fastest way to make progress with the CSN patch is
to make the cache larger, to hide the SLRU lookups. But those other
things would be interesting to explore too.

+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}

Hm - is there any guarantee / documented requirement that subxids is sorted?

Yes, subtransaction XIDs are assigned in order. But point taken; I'll
add a comment of that here too.

+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
Hm. Maybe I'm missing something, but what prevents a concurrent transaction to
check the visibility of a subtransaction between marking the subtransaction
committed and marking the main transaction committed? If subtransaction and
main transaction are on the same page that won't be possible, but if they are
on different ones it does seem possible?

Today XidInMVCCSnapshot() will use pg_subtrans to find the top transaction in
case of a suboverflowed snapshot, but with this patch that's not the case
anymore. Which afaict will mean that repeated snapshot computations could
give different results for the same query?

The concurrent transaction's snapshot would consider the transaction and
all its subtransactions as still in-progress. Depending on the timing,
when XidInMVCCSnapshot() looks up the CSN of the subtransaction XID, it
will either see that it has no CSN which means it's still in progress
and thus invisible, or it has a CSN that is greater than the snapshot's
CSN, and invisible because of that. The global latestCommitLSN value is
advanced only after CSNLogSetCSN() has finished setting the CSN on all
the subtransactions, so before that, there cannot be any snapshots that
would see it as visible yet.

No code changes since v2, except the tests and minor comments, but I'm
including all the patches here again for convenience.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v3-0001-XXX-add-perf-test.patchtext/x-patch; charset=UTF-8; name=v3-0001-XXX-add-perf-test.patchDownload

From ae8a030ff287f94e5c452d20419dfd852348605e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 21 Oct 2024 14:07:38 +0300
Subject: [PATCH v3 1/5] XXX: add perf test

This is not intended to be merged. But it's been useful to have this
in the tree for some quick perf testing during development.

To run it, I've used:

(cd build-release && ninja &&  rm -rf tmp_install && meson test --suite setup --suite test_misc; grep TEST testrun/test_misc/000_csn_perf/log/regress_log_000_csn_perf )

It runs the other test_misc tests concurrently, but they finish a lot
faster so they don't affect the results much.
---
 src/test/modules/test_misc/meson.build       |   1 +
 src/test/modules/test_misc/t/000_csn_perf.pl | 286 +++++++++++++++++++
 2 files changed, 287 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/000_csn_perf.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa751a..e55e80af54 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
+      't/000_csn_perf.pl',
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
diff --git a/src/test/modules/test_misc/t/000_csn_perf.pl b/src/test/modules/test_misc/t/000_csn_perf.pl
new file mode 100644
index 0000000000..ae13fa8200
--- /dev/null
+++ b/src/test/modules/test_misc/t/000_csn_perf.pl
@@ -0,0 +1,286 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Verify that ALTER TABLE optimizes certain operations as expected
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(time);
+
+my $duration = 15; # seconds
+my $miniterations = 3;
+
+# Initialize a test cluster
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+# Turn message level up to DEBUG1 so that we get the messages we want to see
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->append_conf('postgresql.conf', 'max_connections = 1000');
+$primary->start;
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->append_conf('postgresql.conf', "shared_buffers='1 GB'");
+$replica->start;
+
+sub wait_catchup
+{
+	my ($primary, $replica) = @_;
+	
+	my $primary_lsn =
+	  $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+	my $caughtup_query =
+	  "SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+	$replica->poll_query_until('postgres', $caughtup_query)
+	  or die "Timed out while waiting for standby to catch up";
+}
+
+sub repeat_and_time_sql
+{
+  	my ($name, $node, $sql) = @_;
+
+	my $session =  $node->background_psql('postgres', on_error_die => 1);
+	$session->query_safe("SET max_parallel_workers_per_gather=0");
+
+	my $iterations = 0;
+
+	my $now;
+	my $elapsed;
+    my $begin_time = time();
+	while (1) {
+		$session->query_safe($sql);
+		$now = time();
+		$iterations = $iterations + 1;
+
+		$elapsed = $now - $begin_time;
+		if ($elapsed > $duration && $iterations >= $miniterations) {
+			last;
+		}
+	}
+
+	my $periter = $elapsed / $iterations;
+
+	pass ("TEST $name: $elapsed s, $iterations iterations, $periter s / iteration");
+}
+
+
+$primary->safe_psql('postgres', "CREATE TABLE little (i int);");
+$primary->safe_psql('postgres', "INSERT INTO little VALUES (1);");
+
+sub consume_xids
+{
+	my ($node) = @_;
+
+	my $session = $node->background_psql('postgres', on_error_die => 1);
+	for(my $i = 0; $i < 20; $i++) {
+		$session->query_safe(q{do $$
+  begin
+    for i in 1..50 loop
+      begin
+        DELETE from little;
+        perform 1 / 0;
+      exception
+        when division_by_zero then perform 0 /* do nothing */;
+        when others then raise 'fail: %', sqlerrm;
+      end;
+    end loop;
+  end
+$$;});
+	}
+	$session->quit;
+}
+
+# TEST few-xacts
+#
+# Cycle through 4 different top-level XIDs
+#
+# 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 4;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts
+#
+# like few-xacts, but we cycle through 100 different XIDs instead of 4.
+#
+# 1001, 1002, 1003, ... 1100, 1001, 1002, 1003, ... 1100  ....
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts-wide-apart
+#
+# like many-xacts, but the XIDs are more spread out, so that they don't fit in the
+# SLRU caches.
+#
+# 1000, 2000, 3000, 4000, ....
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+
+		consume_xids($primary);
+
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts-wide-apart", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: few-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 4;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+
+# TEST: many-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: many-subxacts-wide-apart
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		consume_xids($primary);
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts-wide-apart", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+done_testing();
-- 
2.39.5

v3-0002-Use-CSN-snapshots-during-Hot-Standby.patchtext/x-patch; charset=UTF-8; name=v3-0002-Use-CSN-snapshots-during-Hot-Standby.patchDownload

From 6a9c3b48af0bfcb0fce665850e1206405aeaa37c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:26:40 +0300
Subject: [PATCH v3 2/5] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |    1 +
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  474 ++++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    1 +
 src/backend/access/transam/xact.c             |  138 +-
 src/backend/access/transam/xlog.c             |  118 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1512 ++---------------
 src/backend/storage/ipc/standby.c             |  102 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   37 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/utils/snapshot.h                  |    7 +
 31 files changed, 821 insertions(+), 1768 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 724122b1bc..6651ba1757 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -583,6 +583,7 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    now perform minimal checking on a standby by always using nextXid, this
  *    approach is better than nothing and will at least catch extremely broken
  *    cases where a xid is in the future.
+ *    XXX KnownAssignedXids is gone.
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 889cb955c1..128486e751 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -424,17 +424,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -462,18 +451,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -505,9 +482,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..2520d77c7c 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..1188a78c4a
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,474 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit record LSNs of finished transactions
+ *
+ * This module provides an SLRU to store the LSN of the commit record of each
+ * transaction. CSN stands for Commit Sequence Number, and in principle we
+ * could use a separate counter that is incremented at every commit. For
+ * simplicity, though, we use the commit records LSN as the sequence number.
+ *
+ * Like pg_subtrans, this mapping need to be kept only for xid's greater then
+ * oldestXmin, and doesn't need to be preserved over crashes.  Also, this is
+ * only needed in hot standby mode, and immediately after exiting hot standby
+ * mode, until all old snapshots taken during standby mode are gone.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CSNLog control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+								TransactionId *subxids,
+								XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+							   int slotno);
+
+
+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, in logical XID order,
+ * representing subtransactions in the tree of XIDs. In various cases nsubxids
+ * may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the CSN log for all
+ * entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as CSNLogSetCSN()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids, TransactionId *subxids,
+					XLogRecPtr commitLsn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i], commitLsn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, commitLsn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	// FIXME: skip if not InHotStandby?
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	//SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ * All committed transactions are stamped with 'csn'
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus	status = TransactionIdGetStatus(*xid, &dummy);
+
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid, and after
+ * initializing the CLOG.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'. ('csn'
+ * is supposed to be an "older than anything we'll ever need to compare with")
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we initialize
+	 * the currently-active page(s) to zeroes during startup. Whenever we
+	 * advance into a new page, ExtendCSNLog will likewise zero the new page
+	 * without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..cf41df2971 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 75b5325df8..93c4d495e4 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have the CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 23dd0c6ef6..7e9fc7c535 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1960,20 +1961,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -2001,34 +1995,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index cfe8c6cf8d..b074423654 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c..fc611f2860 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -209,7 +210,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -249,13 +249,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -531,18 +524,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -635,7 +616,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -679,20 +659,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -728,59 +694,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData((char *) unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1470,11 +1383,11 @@ RecordTransactionCommit(void)
 	 * temp tables will be lost anyway, unlogged tables will be truncated and
 	 * HOT pruning will be done again later. (Given the foregoing, you might
 	 * think that it would be unnecessary to emit the XLOG record at all in
-	 * this case, but we don't currently try to do that.  It would certainly
-	 * cause problems at least in Hot Standby mode, where the
-	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
-	 * might be OK to skip it only when wal_level < replica, but for now we
-	 * don't.)
+	 * this case, but we don't currently try to do that.  It might cause
+	 * inefficiencies in Hot Standby mode, if nothing else, where the
+	 * commit/abort records allow advancing the xmin horizon for new
+	 * snapshots. It might be OK to skip it only when wal_level < replica, but
+	 * for now we don't.)
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
@@ -1942,13 +1855,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2131,12 +2037,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6141,7 +6041,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6161,6 +6061,12 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * Mark the CSNLOG first.  The transaction won't become visible to new
+		 * snapshots until the call to ProcArrayRecoveryEndTransaction().
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6173,9 +6079,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * Make the commit visible to new snapshots in the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6281,7 +6187,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6299,13 +6205,15 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/* Note: we don't need to update the CSN log on abort. */
+
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
 
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6413,14 +6321,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9102c8d772..deb2cd1883 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -950,8 +951,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5175,6 +5174,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5776,16 +5776,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5800,39 +5800,17 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
-			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
-			 */
-			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
 
-				/* Update pg_subtrans entries for any prepared transactions */
-				StandbyRecoverPreparedTransactions();
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
 
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_status = SUBXIDS_IN_SUBTRANS;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-			}
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
+			/* Update pg_subtrans entries for any prepared transactions */
+			if (wasShutdown)
+				StandbyRecoverPreparedTransactions();
 		}
 
 		/*
@@ -5916,7 +5894,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -6082,9 +6060,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6176,12 +6163,12 @@ StartupXLOG(void)
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
 	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
-	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * any session building a snapshot will not rely on the CSN log as
 	 * RecoveryInProgress() would return false at this stage.  This is
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -6953,7 +6940,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7345,7 +7332,10 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
@@ -7518,6 +7508,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7814,7 +7805,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8299,41 +8293,17 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			/* Update pg_subtrans entries for any prepared transactions */
 			StandbyRecoverPreparedTransactions();
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_status = SUBXIDS_IN_SUBTRANS;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
@@ -8397,6 +8367,16 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Remember the oldest XID that was running at the time.  Normally,
+		 * all transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 320b14add1..709756ceba 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1986,10 +1986,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2266,7 +2265,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3711,9 +3710,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3979,9 +3975,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe0..bf08c60e93 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index ef6f98ebcd..a975865fdd 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..caae8f75c2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6a4da3266..734865ce62 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 10fc18f252..ec8dd26bd7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -124,6 +125,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -289,6 +291,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 36610a1c7e..c82e8d8c43 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -73,22 +64,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -99,6 +76,21 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+#define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+
+/*
+ * TOTAL_MAX_CACHED_SUBXIDS is the total number of XIDs that fits in the proc
+ * array, as top XIDs and in the subxids caches.
+ *
+ * Local data structures are also created in various backends during
+ * GetSnapshotData(), TransactionIdIsInProgress() and
+ * GetRunningTransactionData(). All of the main structures created in those
+ * functions must be identically sized, since we may at times copy the whole
+ * of the data structures around.
+ */
+#define TOTAL_MAX_CACHED_SUBXIDS \
+	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+
 /*
  * State for the GlobalVisTest* family of functions. Those functions can
  * e.g. be used to decide if a deleted row can be removed without violating
@@ -254,17 +246,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -278,17 +259,8 @@ static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -315,7 +287,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -325,7 +297,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -338,28 +310,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -383,31 +339,6 @@ ProcArrayShmemSize(void)
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
-	/*
-	 * During Hot Standby processing we have a data structure called
-	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetSnapshotData(),
-	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
-	 * main structures created in those functions must be identically sized,
-	 * since we may at times copy the whole of the data structures around. We
-	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
-	 *
-	 * Ideally we'd only create this structure if we were actually doing hot
-	 * standby in the current run, but we don't know that yet at the time
-	 * shared memory is being set up.
-	 */
-#define TOTAL_MAX_CACHED_SUBXIDS \
-	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
-
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -434,31 +365,12 @@ ProcArrayShmemInit(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1022,355 +934,35 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
 	/*
 	 * Remove stale locks, if any.
 	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (running->subxid_status != SUBXIDS_MISSING || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
+	StandbyReleaseOldLocks(oldestRunningXID);
 
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_status == SUBXIDS_MISSING)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-
-		/*
-		 * If the 'xids' array didn't include all subtransactions, we have to
-		 * mark any snapshots taken as overflowed.
-		 */
-		if (running->subxid_status == SUBXIDS_IN_SUBTRANS)
-			procArray->lastOverflowedXid = latestObservedXid;
-		else
-		{
-			Assert(running->subxid_status == SUBXIDS_IN_ARRAY);
-			procArray->lastOverflowedXid = InvalidTransactionId;
-		}
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1378,23 +970,24 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction is known to have committed
+ * or aborted, otherwise it's considered as running.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * if that is running according to ProcGlobal->xids[].
  * This is the slowest way, but sadly it has to be done always if the others
  * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
- * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
- * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
+ * ProcArrayLock has to be held while we do 2 and 3.  If we save the top Xids
+ * while doing 2 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
  * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
@@ -1435,6 +1028,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1451,12 +1066,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1551,33 +1161,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1851,8 +1434,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1865,11 +1447,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2188,7 +1773,7 @@ GetSnapshotData(Snapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2368,27 +1953,8 @@ GetSnapshotData(Snapshot snapshot)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2396,14 +1962,17 @@ GetSnapshotData(Snapshot snapshot)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * (XXX: not sure that's strictly required.)
+		 * This is what determines which transactions we consider finished and
+		 * which are still in progress.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2519,6 +2088,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->lsn = InvalidXLogRecPtr;
 	snapshot->whenTaken = 0;
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2674,9 +2245,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * The returned data structure is statically allocated; caller should not
  * modify it, and must not assume it is valid past the next call.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
- *
  * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
@@ -2707,6 +2275,7 @@ GetRunningTransactionData(void)
 	int			subcount;
 	bool		suboverflowed;
 
+	/* This is never executed during recovery */
 	Assert(!RecoveryInProgress());
 
 	/*
@@ -2873,15 +2442,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2902,11 +2472,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2915,6 +2487,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2993,8 +2568,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
-	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
-	 * machinery can miss values and return an older value than is safe.
+	 * *not* use oldest_running_primary_xid since the XID tracking machinery
+	 * can miss values and return an older value than is safe.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3412,6 +2987,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4083,14 +3661,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4337,61 +3915,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4406,7 +3929,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4424,38 +3947,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4467,781 +3971,61 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 }
 
 /*
- * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
+ * ProcArrayRecoveryEndTransaction
+ *
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction(). The transaction becomes visible to any new
+ * snapshots taken after this. 'max_xid' is the highest (sub)XID of the
+ * committed transaction, and 'lsn' is LSN of the commit record.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * The transaction and all its subtransactions have been already marked as
+ * committed in the CLOG and in the CSNLOG.
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * If this was the the oldest XID that was still running, advance it.
+	 * This is important for advancing the global xmin, which avoids
+	 * unnecessary recovery conflicts
+	 *
+	 * No locking required because this runs in the startup process.
+	 *
+	 * XXX: the caller actually has a list of XIDs that just committed. We
+	 * could save some clog lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25267f0f85..e02c9ab842 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -1167,7 +1162,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1182,18 +1177,21 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
 
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_status = xlrec->subxid_overflow ? SUBXIDS_MISSING : SUBXIDS_IN_ARRAY;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
-
-		ProcArrayApplyRecoveryInfo(&running);
+		/*
+		 * Remember the oldest XID that was running at the time. Normally, all
+		 * transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1224,50 +1222,46 @@ standby_redo(XLogReaderState *record)
  *
  * This is used for Hot Standby as follows:
  *
- * We can move directly to STANDBY_SNAPSHOT_READY at startup if we
- * start from a shutdown checkpoint because we know nothing was running
- * at that time and our recovery snapshot is known empty. In the more
- * typical case of an online checkpoint we need to jump through a few
- * hoops to get a correct recovery snapshot and this requires a two or
- * sometimes a three stage process.
+ * We can enter hot standby mode and start accepting read-only queries
+ * immediately at startup if we start from a shutdown checkpoint, because we
+ * know nothing was running at that time and our recovery snapshot is known
+ * empty. In the more typical case of an online checkpoint, the checkpoint
+ * record doesn't contain all the necessary information about running
+ * transaction state, and we need to jump through a few hoops to get a correct
+ * recovery snapshot.
  *
- * The initial snapshot must contain all running xids and all current
- * AccessExclusiveLocks at a point in time on the standby. Assembling
- * that information while the server is running requires many and
- * various LWLocks, so we choose to derive that information piece by
- * piece and then re-assemble that info on the standby. When that
- * information is fully assembled we move to STANDBY_SNAPSHOT_READY.
+ * The initial snapshot must contain all current AccessExclusiveLocks at a
+ * point in time on the standby. Assembling that information while the server
+ * is running requires many and various LWLocks, so we choose to derive that
+ * information piece by piece and then re-assemble that info on the standby.
  *
- * Since locking on the primary when we derive the information is not
- * strict, we note that there is a time window between the derivation and
- * writing to WAL of the derived information. That allows race conditions
- * that we must resolve, since xids and locks may enter or leave the
- * snapshot during that window. This creates the issue that an xid or
- * lock may start *after* the snapshot has been derived yet *before* the
- * snapshot is logged in the running xacts WAL record. We resolve this by
- * starting to accumulate changes at a point just prior to when we derive
- * the snapshot on the primary, then ignore duplicates when we later apply
- * the snapshot from the running xacts record. This is implemented during
- * CreateCheckPoint() where we use the logical checkpoint location as
- * our starting point and then write the running xacts record immediately
- * before writing the main checkpoint WAL record. Since we always start
- * up from a checkpoint and are immediately at our starting point, we
- * unconditionally move to STANDBY_INITIALIZED. After this point we
- * must do 4 things:
+ * Since locking on the primary when we derive the information is not strict,
+ * there is a time window between the derivation and writing to WAL of the
+ * derived information. That allows race conditions that we must resolve,
+ * since xids and locks may enter or leave the snapshot during that
+ * window. This creates the issue that an xid or lock may start *after* the
+ * snapshot has been derived yet *before* the snapshot is logged in the
+ * running xacts WAL record. We resolve this by starting to accumulate changes
+ * at a point just prior to when we collect the lock information on the
+ * primary, then ignore duplicates when we later apply the snapshot from the
+ * running xacts record. This is implemented during CreateCheckPoint() where
+ * we use the logical checkpoint location as our starting point and then write
+ * the running xacts record immediately before writing the main checkpoint WAL
+ * record. Since we always start up from a checkpoint's redo pointer, we will
+ * always see a running-xacts record between before reaching the checkpoint
+ * record, and can immediately enter hot standby mode. After this point we
+ * must do 3 things:
  *	* move shared nextXid forwards as we see new xids
  *	* extend the clog and subtrans with each new xid
- *	* keep track of uncommitted known assigned xids
  *	* keep track of uncommitted AccessExclusiveLocks
  *
- * When we see a commit/abort we must remove known assigned xids and locks
- * from the completing transaction. Attempted removals that cannot locate
- * an entry are expected and must not cause an error when we are in state
- * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
- * KnownAssignedXidsRemove().
- *
- * Later, when we apply the running xact data we must be careful to ignore
- * transactions already committed, since those commits raced ahead when
- * making WAL entries.
+ * When we see a commit/abort we must advance oldest_running_primary_xid and
+ * remove locks from the completing transaction. Attempted removals that
+ * cannot locate an entry are expected and must not cause an error until we
+ * have seen the running-xacts record. (We don't throw an error even after
+ * that, because whatever the reason was, after the transaction has completed
+ * the issue has already been resolved anyway.) This is implemented in
+ * StandbyReleaseLocks().
  *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab..60f93a39a4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -130,6 +130,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -166,6 +167,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6..18d7a0ab5b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -359,6 +359,7 @@ WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index d772544377..ffbfae84b8 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..da82def846 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -48,6 +48,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -201,6 +202,7 @@ typedef struct SerializedSnapshotData
 	CommandId	curcid;
 	TimestampTz whenTaken;
 	XLogRecPtr	lsn;
+	XLogRecPtr	snapshotCsn;
 } SerializedSnapshotData;
 
 /*
@@ -1729,6 +1731,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.curcid = snapshot->curcid;
 	serialized_snapshot.whenTaken = snapshot->whenTaken;
 	serialized_snapshot.lsn = snapshot->lsn;
+	serialized_snapshot.snapshotCsn = snapshot->snapshotCsn;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -1803,6 +1806,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1913,36 +1917,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
+		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
 
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
-
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..dfe80eaa0d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -249,7 +249,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..f8cdf573ae
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Mapping from XID to commit record's LSN (Commit Sequence Number).
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+						 TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif							/* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd..a7054fe11c 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr	latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b85b65c604..58ed0fc038 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a..240cbfd417 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -448,7 +439,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 20950ce033..19cb5f33bd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -27,37 +27,10 @@ extern PGDLLIMPORT bool ignore_invalid_pages;
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e0..c2156aca12 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -179,6 +179,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -215,6 +216,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 56af0b40b3..de74fce24e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888..1fda5b06f6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -181,6 +181,13 @@ typedef struct SnapshotData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.5

v3-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchtext/x-patch; charset=UTF-8; name=v3-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchDownload

From a6c0ecdace574bbdf04cfaa48ef2197f0e7ce185 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:01:07 +0300
Subject: [PATCH v3 3/5] Make SnapBuildWaitSnapshot work without
 xl_running_xacts.xids array

SnapBuildWaitSnapshot looped through all the XIDs in the
xl_running_xacts, waiting for them to finish. Change it to grab the
list of running XIDs from the proc array instead. This removes the
last usage of the XIDs array in the xl_running_xacts record, allowing
it to be removed in the next commit.

When SnapBuildWaitSnapshot() is called with running->nextXid as the
'cutoff' point, the new code should wait for exactly the same set of
transactions as before. But when called with initial_xmin_horizon as
the 'cutoff', this might wait for more transactions than before: those
between running->nextXid and initial_xmin_horizon. For example,
imagine that we see a running-xacts record with nextXid 100, and
initial_xmin_horizon is 200. Before, we would wait for all XIDs < 100
to complete, and then log the standby snapshot and proceed, but now we
will wait for all XIDs < 200. I believe that's a good thing, because
we won't actually be able to move to the next state in the snapshot
building until all transactions < 200 have completed. The
running-xacts snapshot that we logged after waiting up to XID 100
would not be useful to us either, if there are still XIDs between 100
and 200 running.

SnapBuildWaitSnapshot() used to do useless work when called in a
standby, because in a standby, there are no XID locks and the
XactLockTableWait() calls returned immediately, even if the XIDs were
in fact still running in the primary. But as the comment says, the
waiting isn't necessary for correctness, so that was harmless. In any
case, stop doing the futile work on a standby.
---
 src/backend/replication/logical/snapbuild.c | 50 ++++++++++++++-------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 734865ce62..31da0832cc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -168,7 +168,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
-static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
+static void SnapBuildWaitSnapshot(TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -1222,14 +1222,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		NormalTransactionIdPrecedes(running->oldestRunningXid,
 									builder->initial_xmin_horizon))
 	{
+		TransactionId cutoff;
+
 		ereport(DEBUG1,
 				(errmsg_internal("skipping snapshot at %X/%X while building logical decoding snapshot, xmin horizon too low",
 								 LSN_FORMAT_ARGS(lsn)),
 				 errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 									builder->initial_xmin_horizon, running->oldestRunningXid)));
 
-
-		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+		cutoff = builder->initial_xmin_horizon;
+		TransactionIdRetreat(cutoff);
+		SnapBuildWaitSnapshot(cutoff);
 
 		return true;
 	}
@@ -1316,7 +1319,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1340,7 +1343,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1373,8 +1376,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 }
 
 /* ---
- * Iterate through xids in record, wait for all older than the cutoff to
- * finish.  Then, if possible, log a new xl_running_xacts record.
+ * Wait for all transactions older than or equal to the cutoff to finish.
+ * Then, if possible, log a new xl_running_xacts record.
  *
  * This isn't required for the correctness of decoding, but to:
  * a) allow isolationtester to notice that we're currently waiting for
@@ -1384,13 +1387,31 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  * ---
  */
 static void
-SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+SnapBuildWaitSnapshot(TransactionId cutoff)
 {
-	int			off;
+	RunningTransactions running;
+
+	if (RecoveryInProgress())
+	{
+		/*
+		 * During recovery, we have no mechanism for waiting for an XID to
+		 * finish, and we cannot create new running-xacts records either.
+		 */
+		return;
+	}
+
+	running = GetRunningTransactionData();
+
+	/*
+	 * GetRunningTransactionData returns with XidGenLock and ProcArrayLock
+	 * held, but we don't need them.
+	 */
+	LWLockRelease(XidGenLock);
+	LWLockRelease(ProcArrayLock);
 
-	for (off = 0; off < running->xcnt; off++)
+	for (int i = 0; i < running->xcnt; i++)
 	{
-		TransactionId xid = running->xids[off];
+		TransactionId xid = running->xids[i];
 
 		/*
 		 * Upper layers should prevent that we ever need to wait on ourselves.
@@ -1400,7 +1421,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 		if (TransactionIdIsCurrentTransactionId(xid))
 			elog(ERROR, "waiting for ourselves");
 
-		if (TransactionIdFollows(xid, cutoff))
+		if (TransactionIdFollowsOrEquals(xid, cutoff))
 			continue;
 
 		XactLockTableWait(xid, NULL, NULL, XLTW_None);
@@ -1412,10 +1433,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 	 * wait for bgwriter or checkpointer to log one.  During recovery we can't
 	 * enforce that, so we'll have to wait.
 	 */
-	if (!RecoveryInProgress())
-	{
-		LogStandbySnapshot();
-	}
+	LogStandbySnapshot();
 }
 
 #define SnapBuildOnDiskConstantSize \
-- 
2.39.5

v3-0004-Remove-the-now-unused-xids-array-from-xl_running_.patchtext/x-patch; charset=UTF-8; name=v3-0004-Remove-the-now-unused-xids-array-from-xl_running_.patchDownload

From 1bdcf1ec1080ffdee97acb1461afea6cfa808688 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 16:40:57 +0300
Subject: [PATCH v3 4/5] Remove the now-unused xids array from xl_running_xacts

We still generate running-xacts records, because they are still needed
to initialize the snapshot in logical decoding.
---
 src/backend/access/rmgrdesc/standbydesc.c   | 18 ------------
 src/backend/replication/logical/snapbuild.c |  8 +++---
 src/backend/storage/ipc/standby.c           | 32 +++++----------------
 src/include/storage/standby.h               |  2 --
 src/include/storage/standbydefs.h           | 16 +++++++----
 5 files changed, 21 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 25f870b187..bde9350b92 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -19,28 +19,10 @@
 static void
 standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
-	int			i;
-
 	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
 					 xlrec->oldestRunningXid);
-	if (xlrec->xcnt > 0)
-	{
-		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
-		for (i = 0; i < xlrec->xcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[i]);
-	}
-
-	if (xlrec->subxid_overflow)
-		appendStringInfoString(buf, "; subxid overflowed");
-
-	if (xlrec->subxcnt > 0)
-	{
-		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
-		for (i = 0; i < xlrec->subxcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
-	}
 }
 
 void
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 31da0832cc..cac3ffe577 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1316,8 +1316,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial starting point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
@@ -1340,8 +1340,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial consistent point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index e02c9ab842..6ed46bed03 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1337,9 +1337,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xl_running_xacts xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.xcnt = CurrRunningXacts->xcnt;
-	xlrec.subxcnt = CurrRunningXacts->subxcnt;
-	xlrec.subxid_overflow = (CurrRunningXacts->subxid_status != SUBXIDS_IN_ARRAY);
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
@@ -1347,31 +1344,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	/* Header */
 	XLogBeginInsert();
 	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData((char *) (&xlrec), MinSizeOfXactRunningXacts);
-
-	/* array of TransactionIds */
-	if (xlrec.xcnt > 0)
-		XLogRegisterData((char *) CurrRunningXacts->xids,
-						 (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
+	XLogRegisterData((char *) (&xlrec), SizeOfXactRunningXacts);
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
-	if (xlrec.subxid_overflow)
-		elog(DEBUG2,
-			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
-	else
-		elog(DEBUG2,
-			 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+	elog(DEBUG2,
+		 "logging running transaction bounds (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+		 LSN_FORMAT_ARGS(recptr),
+		 CurrRunningXacts->oldestRunningXid,
+		 CurrRunningXacts->latestCompletedXid,
+		 CurrRunningXacts->nextXid);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index cce0bc521e..9d5a298a39 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -60,8 +60,6 @@ extern void StandbyReleaseLockTree(TransactionId xid,
 extern void StandbyReleaseAllLocks(void);
 extern void StandbyReleaseOldLocks(TransactionId oldxid);
 
-#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
-
 
 /*
  * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index fe12f463a8..d858209447 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -42,20 +42,24 @@ typedef struct xl_standby_locks
 } xl_standby_locks;
 
 /*
- * When we write running xact data to WAL, we use this structure.
+ * Data included in an XLOG_RUNNING_XACTS record.
+ *
+ * This used to include a list of running XIDs, hence the name, but nowadays
+ * this only contains the min and max bounds of the transactions that were
+ * running when the record was written.  They are needed to initialize logical
+ * decoding.  They are also used in hot standby to prune information about old
+ * running transactions, in case the the primary didn't write a COMMIT/ABORT
+ * record for some reason.
  */
 typedef struct xl_running_xacts
 {
-	int			xcnt;			/* # of xact ids in xids[] */
-	int			subxcnt;		/* # of subxact ids in xids[] */
-	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* xid from TransamVariables->nextXid */
 	TransactionId oldestRunningXid; /* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
-
-	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
 
+#define SizeOfXactRunningXacts sizeof(xl_running_xacts)
+
 /*
  * Invalidations for standby, currently only when transactions without an
  * assigned xid commit.
-- 
2.39.5

v3-0005-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patchtext/x-patch; charset=UTF-8; name=v3-0005-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patchDownload

From a068adaa11068c7a0fd7d44a04a257c7801fd945 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 18:15:49 +0300
Subject: [PATCH v3 5/5] Add a small cache to Snapshot to avoid CSN lookups

Keep the status of a few recently-looked up XIDs cached in the
SnapshotData. This avoids having to go the CSN log in the common case
that the same XIDs are looked up over and over again.
---
 src/backend/utils/time/snapmgr.c | 28 +++++++++++++++++++++++++++-
 src/include/utils/snapshot.h     |  4 ++++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index da82def846..e2b65e0dd5 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1807,6 +1807,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
 	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
+	memset(snapshot->visible_cache, 0, sizeof(snapshot->visible_cache));
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1917,12 +1918,37 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
+		XLogRecPtr	csn;
 
+		/* see if we have this cached */
+		for (int i = 0; i < VISIBLE_CACHE_XACTS; i++)
+		{
+			if (snapshot->visible_cache[i] == xid)
+				return true;
+		}
+		for (int i = 0; i < VISIBLE_CACHE_XACTS; i++)
+		{
+			if (snapshot->invisible_cache[i] == xid)
+				return false;
+		}
+
+		csn = CSNLogGetCSNByXid(xid);
 		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+		{
+			static uint8 last = 0;
+
+			snapshot->invisible_cache[last % VISIBLE_CACHE_XACTS] = xid;
+			last++;
 			return false;
+		}
 		else
+		{
+			static uint8 last = 0;
+
+			snapshot->visible_cache[last % VISIBLE_CACHE_XACTS] = xid;
+			last++;
 			return true;
+		}
 	}
 
 	return false;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 1fda5b06f6..88cfce2ffe 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -188,6 +188,10 @@ typedef struct SnapshotData
 	 */
 	XLogRecPtr	snapshotCsn;
 
+#define VISIBLE_CACHE_XACTS 4
+	TransactionId visible_cache[VISIBLE_CACHE_XACTS];
+	TransactionId invisible_cache[VISIBLE_CACHE_XACTS];
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.5

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Heikki Linnakangas (#7)

5 attachment(s)

Re: CSN snapshots in hot standby

On 21/10/2024 20:32, Heikki Linnakangas wrote:

On 24/09/2024 21:08, Andres Freund wrote:

I'd like to see some numbers for a workload with many overlapping
top-level
transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
whereas this patch would need to do csn lookups. And a four entry cache
probably wouldn't help very much.

I spent some more on the tests. Here is a better set of adversarial
tests, which hit the worst case scenarios for this patch.

All the test scenarios have this high-level shape:

1. Create a table with 100000 rows, vacuum freeze it.

2. In primary, open transactions or subtransactions, and DELETE all rows
using the different (sub)transactions, to set the xmax of every row on
the test table. Leave the transactions open.

3. In standby, SELECT COUNT(*) all rows in the table, and measure how
long it takes.

The difference between the test scenarios is in the pattern of xmax
values, i.e. how many transactions or subtransactions were used. All the
rows are visible, the performance differences come just from how
expensive the visibility checks are in different cases.

First, the results on 'master' without patches (smaller is better):

few-xacts:                 0.0041 s / iteration
many-xacts:                0.0042 s / iteration
many-xacts-wide-apart:     0.0042 s / iteration
few-subxacts:              0.0042 s / iteration
many-subxacts:             0.0073 s / iteration
many-subxacts-wide-apart: 0.10   s / iteration

So even on master, there are significant differences depending on
whether the sub-XIDs fit in the in-memory caches, or if you need to do
lookups in pg_subtrans. That's not surprising. Note how bad the
"many-subxacts-wide-apart" scenario is, though. It's over 20x slower
than the best case scenario! I was a little taken aback by that. More on
that later.

Descriptions of the test scenarios:

few-xacts: The xmax values on the rows cycle through four different
XIDs, like this: 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...

many-xacts: like 'few-xacts', but cycle through 100 different XIDs.

many-xacts-wide-apart: like 'many-xacts', but the XIDs used are spread
out, so that there are 1000 unrelated committed XIDs in between each XID
used in the test table. I.e. "1000, 2000, 3000, 4000, 5000, ...". It
doesn't make a difference in the 'many-xacts-wide-apart' test, but in
the many-subxacts-wide-apart variant it does. It makes the XIDs fall on
different SLRU pages so that there are not enough SLRU buffers to hold
them all.

few-subxacts, many-subxacts, many-subxacts-wide-apart: Same tests, but
instead of using different top-level XIDs, all the XIDs are
subtransactions belonging to a single top-level XID.

Now, with the patch (the unpatched numbers are repeated here for
comparison):

                           master     patched
few-xacts:                 0.0041      0.0040 s / iteration
many-xacts:                0.0042      0.0053 s / iteration
many-xacts-wide-apart:     0.0042      0.17   s / iteration
few-subxacts:              0.0042      0.0040 s / iteration
many-subxacts:             0.0073      0.0052 s / iteration
many-subxacts-wide-apart: 0.10        0.22   s / iteration

So when the 4-element cache is effective, in the 'few-xacts' case, the
patch performs well. In the 'many-xacts' case, it needs to perform CSN
lookups, making it a little slower. The 'many-xacts-wide-apart'
regresses badly, showing the same SLRU trashing effect on CSN lookups as
the 'many-subxacts-wide-apart' case does on 'master' on pg_subtrans
lookups.

Here's another version, which replaces the small 4-element cache with a
cache with no size limit. It's implemented as a radix tree and entries
are never removed, so it can grow to hold the status of all XIDs between
the snapshot's xmin and xmax at most.

This new cache solves the performance issue with the earlier tests:

master patched
few-xacts: 0.0041 0.0041 s / iteration
many-xacts: 0.0042 0.0042 s / iteration
many-xacts-wide-apart: 0.0043 0.0045 s / iteration
few-subxacts: 0.0043 0.0042 s / iteration
many-subxacts: 0.0076 0.0042 s / iteration
many-subxacts-wide-apart: 0.11 0.0070 s / iteration

The new cache also elides the slow pg_subtrans lookups that makes
many-subxacts-wide-apart case slow on 'master', which is nice.

I added two tests to the test suite:
master patched
insert-all-different-xids: 0.00027 0.00019 s / iteration
insert-all-different-subxids: 0.00023 0.00020 s / iteration

insert-all-different-xids: Open 1000 connections, insert one row in
each, and leave the transactions open. In the replica, select all the rows

insert-all-different-subxids: The same, but with 1 transaction with 1000
subxids.

The point of these new tests is to test the scenario where the cache
doesn't help and just adds overhead, because each XID is looked up only
once. Seems to be fine. Surprisingly good actually; I'll do some more
profiling on that to understand why it's even faster than 'master'.

Now the downside of this new cache: Since it has no size limit, if you
keep looking up different XIDs, it will keep growing until it holds all
the XIDs between the snapshot's xmin and xmax. That can take a lot of
memory in the worst case. Radix tree is pretty memory efficient, but
holding, say 1 billion XIDs would probably take something like 500 MB of
RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
radix tree nodes). That's per snapshot, so if you have a lot of
connections, maybe even with multiple snapshots each, that can add up.

I'm inclined to accept that memory usage. If we wanted to limit the size
of the cache, would need to choose a policy on how to truncate it
(delete random nodes?), what the limit should be etc. But I think it'd
be rare to hit those cases in practice. If you have a one billion XID
old transaction running in the primary, you probably have bigger
problems already.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-XXX-add-perf-test.patchtext/x-patch; charset=UTF-8; name=0001-XXX-add-perf-test.patchDownload

From 7b63d43eea7afa7cda8d96b5b8ff40ec0c83e630 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 21 Oct 2024 14:07:38 +0300
Subject: [PATCH 1/5] XXX: add perf test

This is not intended to be merged. But it's been useful to have this
in the tree for some quick perf testing during development.

To run it, I've used:

(cd build-release && ninja &&  rm -rf tmp_install && meson test --suite setup --suite test_misc; grep TEST testrun/test_misc/000_csn_perf/log/regress_log_000_csn_perf )

It runs the other test_misc tests concurrently, but they finish a lot
faster so they don't affect the results much.
---
 src/test/modules/test_misc/meson.build       |   1 +
 src/test/modules/test_misc/t/000_csn_perf.pl | 337 +++++++++++++++++++
 2 files changed, 338 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/000_csn_perf.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa751aa..e55e80af54e 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
+      't/000_csn_perf.pl',
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
diff --git a/src/test/modules/test_misc/t/000_csn_perf.pl b/src/test/modules/test_misc/t/000_csn_perf.pl
new file mode 100644
index 00000000000..3915878a407
--- /dev/null
+++ b/src/test/modules/test_misc/t/000_csn_perf.pl
@@ -0,0 +1,337 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Verify that ALTER TABLE optimizes certain operations as expected
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(time);
+
+my $duration = 15; # seconds
+my $miniterations = 3;
+
+# Initialize a test cluster
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+# Turn message level up to DEBUG1 so that we get the messages we want to see
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->append_conf('postgresql.conf', 'max_connections = 1005');
+$primary->start;
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->append_conf('postgresql.conf', "shared_buffers='1 GB'");
+$replica->start;
+
+sub wait_catchup
+{
+	my ($primary, $replica) = @_;
+	
+	my $primary_lsn =
+	  $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+	my $caughtup_query =
+	  "SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+	$replica->poll_query_until('postgres', $caughtup_query)
+	  or die "Timed out while waiting for standby to catch up";
+}
+
+sub repeat_and_time_sql
+{
+  	my ($name, $node, $sql) = @_;
+
+	my $session =  $node->background_psql('postgres', on_error_die => 1);
+	$session->query_safe("SET max_parallel_workers_per_gather=0");
+
+	my $iterations = 0;
+
+	my $now;
+	my $elapsed;
+    my $begin_time = time();
+	while (1) {
+		$session->query_safe($sql);
+		$now = time();
+		$iterations = $iterations + 1;
+
+		$elapsed = $now - $begin_time;
+		if ($elapsed > $duration && $iterations >= $miniterations) {
+			last;
+		}
+	}
+
+	my $periter = $elapsed / $iterations;
+
+	pass ("TEST $name: $elapsed s, $iterations iterations, $periter s / iteration");
+}
+
+
+$primary->safe_psql('postgres', "CREATE TABLE little (i int);");
+$primary->safe_psql('postgres', "INSERT INTO little VALUES (1);");
+
+sub consume_xids
+{
+	my ($node) = @_;
+
+	my $session = $node->background_psql('postgres', on_error_die => 1);
+	for(my $i = 0; $i < 20; $i++) {
+		$session->query_safe(q{do $$
+  begin
+    for i in 1..50 loop
+      begin
+        DELETE from little;
+        perform 1 / 0;
+      exception
+        when division_by_zero then perform 0 /* do nothing */;
+        when others then raise 'fail: %', sqlerrm;
+      end;
+    end loop;
+  end
+$$;});
+	}
+	$session->quit;
+}
+
+# TEST few-xacts
+#
+# Cycle through 4 different top-level XIDs
+#
+# 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 4;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts
+#
+# like few-xacts, but we cycle through 100 different XIDs instead of 4.
+#
+# 1001, 1002, 1003, ... 1100, 1001, 1002, 1003, ... 1100  ....
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts-wide-apart
+#
+# like many-xacts, but the XIDs are more spread out, so that they don't fit in the
+# SLRU caches.
+#
+# 1000, 2000, 3000, 4000, ....
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+
+		consume_xids($primary);
+
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts-wide-apart", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: few-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 4;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+
+# TEST: many-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: many-subxacts-wide-apart
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		consume_xids($primary);
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts-wide-apart", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-xids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+	my @primary_sessions = ();
+	my $num_connections = 1000;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("INSERT INTO tbl VALUES ($i)");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-xids", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-subxids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i; INSERT INTO tbl VALUES($i); release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-subxids", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+done_testing();
-- 
2.39.5

0002-Use-CSN-snapshots-during-Hot-Standby.patchtext/x-patch; charset=UTF-8; name=0002-Use-CSN-snapshots-during-Hot-Standby.patchDownload

From 6518ecc0bafa06c872ed5309584a3345e5835a3c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:26:40 +0300
Subject: [PATCH 2/5] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |    1 +
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  474 ++++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    1 +
 src/backend/access/transam/xact.c             |  138 +-
 src/backend/access/transam/xlog.c             |  118 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1512 ++---------------
 src/backend/storage/ipc/standby.c             |  102 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   37 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/utils/snapshot.h                  |    7 +
 31 files changed, 821 insertions(+), 1768 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5d0deaba61e..7905a91412a 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -581,6 +581,7 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    now perform minimal checking on a standby by always using nextXid, this
  *    approach is better than nothing and will at least catch extremely broken
  *    cases where a xid is in the future.
+ *    XXX KnownAssignedXids is gone.
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 889cb955c18..128486e751e 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -424,17 +424,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -462,18 +451,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -505,9 +482,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index a32f473e0a2..cb54d999587 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 00000000000..1188a78c4a8
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,474 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit record LSNs of finished transactions
+ *
+ * This module provides an SLRU to store the LSN of the commit record of each
+ * transaction. CSN stands for Commit Sequence Number, and in principle we
+ * could use a separate counter that is incremented at every commit. For
+ * simplicity, though, we use the commit records LSN as the sequence number.
+ *
+ * Like pg_subtrans, this mapping need to be kept only for xid's greater then
+ * oldestXmin, and doesn't need to be preserved over crashes.  Also, this is
+ * only needed in hot standby mode, and immediately after exiting hot standby
+ * mode, until all old snapshots taken during standby mode are gone.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CSNLog control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+								TransactionId *subxids,
+								XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+							   int slotno);
+
+
+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, in logical XID order,
+ * representing subtransactions in the tree of XIDs. In various cases nsubxids
+ * may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the CSN log for all
+ * entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as CSNLogSetCSN()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids, TransactionId *subxids,
+					XLogRecPtr commitLsn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i], commitLsn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, commitLsn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	// FIXME: skip if not InHotStandby?
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	//SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ * All committed transactions are stamped with 'csn'
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus	status = TransactionIdGetStatus(*xid, &dummy);
+
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid, and after
+ * initializing the CLOG.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'. ('csn'
+ * is supposed to be an "older than anything we'll ever need to compare with")
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we initialize
+	 * the currently-active page(s) to zeroes during startup. Whenever we
+	 * advance into a new page, ExtendCSNLog will likewise zero the new page
+	 * without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 91d258f9df1..763f1ce44f0 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 75b5325df8b..93c4d495e4b 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have the CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 23dd0c6ef6e..7e9fc7c5355 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1960,20 +1961,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -2001,34 +1995,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index cfe8c6cf8dc..b0744236541 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 004f7e10e55..a0dbb6d281c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -210,7 +211,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -250,13 +250,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -532,18 +525,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -636,7 +617,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -680,20 +660,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -729,59 +695,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData((char *) unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1481,11 +1394,11 @@ RecordTransactionCommit(void)
 	 * temp tables will be lost anyway, unlogged tables will be truncated and
 	 * HOT pruning will be done again later. (Given the foregoing, you might
 	 * think that it would be unnecessary to emit the XLOG record at all in
-	 * this case, but we don't currently try to do that.  It would certainly
-	 * cause problems at least in Hot Standby mode, where the
-	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
-	 * might be OK to skip it only when wal_level < replica, but for now we
-	 * don't.)
+	 * this case, but we don't currently try to do that.  It might cause
+	 * inefficiencies in Hot Standby mode, if nothing else, where the
+	 * commit/abort records allow advancing the xmin horizon for new
+	 * snapshots. It might be OK to skip it only when wal_level < replica, but
+	 * for now we don't.)
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
@@ -1953,13 +1866,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2142,12 +2048,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6161,7 +6061,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6181,6 +6081,12 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * Mark the CSNLOG first.  The transaction won't become visible to new
+		 * snapshots until the call to ProcArrayRecoveryEndTransaction().
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6193,9 +6099,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * Make the commit visible to new snapshots in the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6301,7 +6207,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6319,13 +6225,15 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/* Note: we don't need to update the CSN log on abort. */
+
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
 
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6433,14 +6341,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3ecaf181392..0739e049934 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -952,8 +953,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5177,6 +5176,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5778,16 +5778,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5802,39 +5802,17 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
-			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
-			 */
-			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
 
-				/* Update pg_subtrans entries for any prepared transactions */
-				StandbyRecoverPreparedTransactions();
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
 
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_status = SUBXIDS_IN_SUBTRANS;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-			}
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
+			/* Update pg_subtrans entries for any prepared transactions */
+			if (wasShutdown)
+				StandbyRecoverPreparedTransactions();
 		}
 
 		/*
@@ -5918,7 +5896,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -6084,9 +6062,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6178,12 +6165,12 @@ StartupXLOG(void)
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
 	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
-	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * any session building a snapshot will not rely on the CSN log as
 	 * RecoveryInProgress() would return false at this stage.  This is
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -6955,7 +6942,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7347,7 +7334,10 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
@@ -7520,6 +7510,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7816,7 +7807,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8301,41 +8295,17 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			/* Update pg_subtrans entries for any prepared transactions */
 			StandbyRecoverPreparedTransactions();
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_status = SUBXIDS_IN_SUBTRANS;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
@@ -8399,6 +8369,16 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Remember the oldest XID that was running at the time.  Normally,
+		 * all transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 31caa49d6c3..65c15449a0a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1986,10 +1986,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2266,7 +2265,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3711,9 +3710,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3979,9 +3975,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe07..bf08c60e93a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index ef6f98ebcd7..a975865fdd9 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e73576ad12f..c4f9feed649 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6a4da32668..734865ce621 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index d68aa29d93e..932acf385cc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -122,6 +123,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -287,6 +289,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 36610a1c7e7..c82e8d8c438 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -73,22 +64,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -99,6 +76,21 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+#define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+
+/*
+ * TOTAL_MAX_CACHED_SUBXIDS is the total number of XIDs that fits in the proc
+ * array, as top XIDs and in the subxids caches.
+ *
+ * Local data structures are also created in various backends during
+ * GetSnapshotData(), TransactionIdIsInProgress() and
+ * GetRunningTransactionData(). All of the main structures created in those
+ * functions must be identically sized, since we may at times copy the whole
+ * of the data structures around.
+ */
+#define TOTAL_MAX_CACHED_SUBXIDS \
+	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+
 /*
  * State for the GlobalVisTest* family of functions. Those functions can
  * e.g. be used to decide if a deleted row can be removed without violating
@@ -254,17 +246,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -278,17 +259,8 @@ static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -315,7 +287,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -325,7 +297,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -338,28 +310,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -383,31 +339,6 @@ ProcArrayShmemSize(void)
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
-	/*
-	 * During Hot Standby processing we have a data structure called
-	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetSnapshotData(),
-	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
-	 * main structures created in those functions must be identically sized,
-	 * since we may at times copy the whole of the data structures around. We
-	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
-	 *
-	 * Ideally we'd only create this structure if we were actually doing hot
-	 * standby in the current run, but we don't know that yet at the time
-	 * shared memory is being set up.
-	 */
-#define TOTAL_MAX_CACHED_SUBXIDS \
-	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
-
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -434,31 +365,12 @@ ProcArrayShmemInit(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1022,355 +934,35 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
 	/*
 	 * Remove stale locks, if any.
 	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (running->subxid_status != SUBXIDS_MISSING || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
+	StandbyReleaseOldLocks(oldestRunningXID);
 
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_status == SUBXIDS_MISSING)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-
-		/*
-		 * If the 'xids' array didn't include all subtransactions, we have to
-		 * mark any snapshots taken as overflowed.
-		 */
-		if (running->subxid_status == SUBXIDS_IN_SUBTRANS)
-			procArray->lastOverflowedXid = latestObservedXid;
-		else
-		{
-			Assert(running->subxid_status == SUBXIDS_IN_ARRAY);
-			procArray->lastOverflowedXid = InvalidTransactionId;
-		}
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1378,23 +970,24 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction is known to have committed
+ * or aborted, otherwise it's considered as running.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * if that is running according to ProcGlobal->xids[].
  * This is the slowest way, but sadly it has to be done always if the others
  * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
- * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
- * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
+ * ProcArrayLock has to be held while we do 2 and 3.  If we save the top Xids
+ * while doing 2 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
  * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
@@ -1435,6 +1028,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1451,12 +1066,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1551,33 +1161,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1851,8 +1434,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1865,11 +1447,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2188,7 +1773,7 @@ GetSnapshotData(Snapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2368,27 +1953,8 @@ GetSnapshotData(Snapshot snapshot)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2396,14 +1962,17 @@ GetSnapshotData(Snapshot snapshot)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * (XXX: not sure that's strictly required.)
+		 * This is what determines which transactions we consider finished and
+		 * which are still in progress.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2519,6 +2088,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->lsn = InvalidXLogRecPtr;
 	snapshot->whenTaken = 0;
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2674,9 +2245,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * The returned data structure is statically allocated; caller should not
  * modify it, and must not assume it is valid past the next call.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
- *
  * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
@@ -2707,6 +2275,7 @@ GetRunningTransactionData(void)
 	int			subcount;
 	bool		suboverflowed;
 
+	/* This is never executed during recovery */
 	Assert(!RecoveryInProgress());
 
 	/*
@@ -2873,15 +2442,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2902,11 +2472,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2915,6 +2487,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2993,8 +2568,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
-	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
-	 * machinery can miss values and return an older value than is safe.
+	 * *not* use oldest_running_primary_xid since the XID tracking machinery
+	 * can miss values and return an older value than is safe.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3412,6 +2987,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4083,14 +3661,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4337,61 +3915,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4406,7 +3929,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4424,38 +3947,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4467,781 +3971,61 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 }
 
 /*
- * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
+ * ProcArrayRecoveryEndTransaction
+ *
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction(). The transaction becomes visible to any new
+ * snapshots taken after this. 'max_xid' is the highest (sub)XID of the
+ * committed transaction, and 'lsn' is LSN of the commit record.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * The transaction and all its subtransactions have been already marked as
+ * committed in the CLOG and in the CSNLOG.
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * If this was the the oldest XID that was still running, advance it.
+	 * This is important for advancing the global xmin, which avoids
+	 * unnecessary recovery conflicts
+	 *
+	 * No locking required because this runs in the startup process.
+	 *
+	 * XXX: the caller actually has a list of XIDs that just committed. We
+	 * could save some clog lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25267f0f85d..e02c9ab842d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -1167,7 +1162,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1182,18 +1177,21 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
 
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_status = xlrec->subxid_overflow ? SUBXIDS_MISSING : SUBXIDS_IN_ARRAY;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
-
-		ProcArrayApplyRecoveryInfo(&running);
+		/*
+		 * Remember the oldest XID that was running at the time. Normally, all
+		 * transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1224,50 +1222,46 @@ standby_redo(XLogReaderState *record)
  *
  * This is used for Hot Standby as follows:
  *
- * We can move directly to STANDBY_SNAPSHOT_READY at startup if we
- * start from a shutdown checkpoint because we know nothing was running
- * at that time and our recovery snapshot is known empty. In the more
- * typical case of an online checkpoint we need to jump through a few
- * hoops to get a correct recovery snapshot and this requires a two or
- * sometimes a three stage process.
+ * We can enter hot standby mode and start accepting read-only queries
+ * immediately at startup if we start from a shutdown checkpoint, because we
+ * know nothing was running at that time and our recovery snapshot is known
+ * empty. In the more typical case of an online checkpoint, the checkpoint
+ * record doesn't contain all the necessary information about running
+ * transaction state, and we need to jump through a few hoops to get a correct
+ * recovery snapshot.
  *
- * The initial snapshot must contain all running xids and all current
- * AccessExclusiveLocks at a point in time on the standby. Assembling
- * that information while the server is running requires many and
- * various LWLocks, so we choose to derive that information piece by
- * piece and then re-assemble that info on the standby. When that
- * information is fully assembled we move to STANDBY_SNAPSHOT_READY.
+ * The initial snapshot must contain all current AccessExclusiveLocks at a
+ * point in time on the standby. Assembling that information while the server
+ * is running requires many and various LWLocks, so we choose to derive that
+ * information piece by piece and then re-assemble that info on the standby.
  *
- * Since locking on the primary when we derive the information is not
- * strict, we note that there is a time window between the derivation and
- * writing to WAL of the derived information. That allows race conditions
- * that we must resolve, since xids and locks may enter or leave the
- * snapshot during that window. This creates the issue that an xid or
- * lock may start *after* the snapshot has been derived yet *before* the
- * snapshot is logged in the running xacts WAL record. We resolve this by
- * starting to accumulate changes at a point just prior to when we derive
- * the snapshot on the primary, then ignore duplicates when we later apply
- * the snapshot from the running xacts record. This is implemented during
- * CreateCheckPoint() where we use the logical checkpoint location as
- * our starting point and then write the running xacts record immediately
- * before writing the main checkpoint WAL record. Since we always start
- * up from a checkpoint and are immediately at our starting point, we
- * unconditionally move to STANDBY_INITIALIZED. After this point we
- * must do 4 things:
+ * Since locking on the primary when we derive the information is not strict,
+ * there is a time window between the derivation and writing to WAL of the
+ * derived information. That allows race conditions that we must resolve,
+ * since xids and locks may enter or leave the snapshot during that
+ * window. This creates the issue that an xid or lock may start *after* the
+ * snapshot has been derived yet *before* the snapshot is logged in the
+ * running xacts WAL record. We resolve this by starting to accumulate changes
+ * at a point just prior to when we collect the lock information on the
+ * primary, then ignore duplicates when we later apply the snapshot from the
+ * running xacts record. This is implemented during CreateCheckPoint() where
+ * we use the logical checkpoint location as our starting point and then write
+ * the running xacts record immediately before writing the main checkpoint WAL
+ * record. Since we always start up from a checkpoint's redo pointer, we will
+ * always see a running-xacts record between before reaching the checkpoint
+ * record, and can immediately enter hot standby mode. After this point we
+ * must do 3 things:
  *	* move shared nextXid forwards as we see new xids
  *	* extend the clog and subtrans with each new xid
- *	* keep track of uncommitted known assigned xids
  *	* keep track of uncommitted AccessExclusiveLocks
  *
- * When we see a commit/abort we must remove known assigned xids and locks
- * from the completing transaction. Attempted removals that cannot locate
- * an entry are expected and must not cause an error when we are in state
- * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
- * KnownAssignedXidsRemove().
- *
- * Later, when we apply the running xact data we must be careful to ignore
- * transactions already committed, since those commits raced ahead when
- * making WAL entries.
+ * When we see a commit/abort we must advance oldest_running_primary_xid and
+ * remove locks from the completing transaction. Attempted removals that
+ * cannot locate an entry are expected and must not cause an error until we
+ * have seen the running-xacts record. (We don't throw an error even after
+ * that, because whatever the reason was, after the transaction has completed
+ * the issue has already been resolved anyway.) This is implemented in
+ * StandbyReleaseLocks().
  *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab3..60f93a39a47 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -130,6 +130,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -166,6 +167,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..18d7a0ab5bf 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -359,6 +359,7 @@ WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index d7725443774..ffbfae84b80 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..da82def8461 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -48,6 +48,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -201,6 +202,7 @@ typedef struct SerializedSnapshotData
 	CommandId	curcid;
 	TimestampTz whenTaken;
 	XLogRecPtr	lsn;
+	XLogRecPtr	snapshotCsn;
 } SerializedSnapshotData;
 
 /*
@@ -1729,6 +1731,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.curcid = snapshot->curcid;
 	serialized_snapshot.whenTaken = snapshot->whenTaken;
 	serialized_snapshot.lsn = snapshot->lsn;
+	serialized_snapshot.snapshotCsn = snapshot->snapshotCsn;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -1803,6 +1806,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1913,36 +1917,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
+		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
 
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
-
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783e..dfe80eaa0dd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -249,7 +249,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 00000000000..f8cdf573aef
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Mapping from XID to commit record's LSN (Commit Sequence Number).
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+						 TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif							/* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..a7054fe11cd 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr	latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b85b65c604e..58ed0fc038b 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a2..240cbfd4170 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -448,7 +439,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 20950ce0336..19cb5f33bd5 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -27,37 +27,10 @@ extern PGDLLIMPORT bool ignore_invalid_pages;
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..c2156aca12d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -179,6 +179,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -215,6 +216,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 56af0b40b32..de74fce24e4 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888e..1fda5b06f67 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -181,6 +181,13 @@ typedef struct SnapshotData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.5

0003-Make-SnapBuildWaitSnapshot-work-without-xl_running_x.patchtext/x-patch; charset=UTF-8; name=0003-Make-SnapBuildWaitSnapshot-work-without-xl_running_x.patchDownload

From a89122bfb0bba7c73b13139a0aa56b8757d898b7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:01:07 +0300
Subject: [PATCH 3/5] Make SnapBuildWaitSnapshot work without
 xl_running_xacts.xids array

SnapBuildWaitSnapshot looped through all the XIDs in the
xl_running_xacts, waiting for them to finish. Change it to grab the
list of running XIDs from the proc array instead. This removes the
last usage of the XIDs array in the xl_running_xacts record, allowing
it to be removed in the next commit.

When SnapBuildWaitSnapshot() is called with running->nextXid as the
'cutoff' point, the new code should wait for exactly the same set of
transactions as before. But when called with initial_xmin_horizon as
the 'cutoff', this might wait for more transactions than before: those
between running->nextXid and initial_xmin_horizon. For example,
imagine that we see a running-xacts record with nextXid 100, and
initial_xmin_horizon is 200. Before, we would wait for all XIDs < 100
to complete, and then log the standby snapshot and proceed, but now we
will wait for all XIDs < 200. I believe that's a good thing, because
we won't actually be able to move to the next state in the snapshot
building until all transactions < 200 have completed. The
running-xacts snapshot that we logged after waiting up to XID 100
would not be useful to us either, if there are still XIDs between 100
and 200 running.

SnapBuildWaitSnapshot() used to do useless work when called in a
standby, because in a standby, there are no XID locks and the
XactLockTableWait() calls returned immediately, even if the XIDs were
in fact still running in the primary. But as the comment says, the
waiting isn't necessary for correctness, so that was harmless. In any
case, stop doing the futile work on a standby.
---
 src/backend/replication/logical/snapbuild.c | 50 ++++++++++++++-------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 734865ce621..31da0832cc3 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -168,7 +168,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
-static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
+static void SnapBuildWaitSnapshot(TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -1222,14 +1222,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		NormalTransactionIdPrecedes(running->oldestRunningXid,
 									builder->initial_xmin_horizon))
 	{
+		TransactionId cutoff;
+
 		ereport(DEBUG1,
 				(errmsg_internal("skipping snapshot at %X/%X while building logical decoding snapshot, xmin horizon too low",
 								 LSN_FORMAT_ARGS(lsn)),
 				 errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 									builder->initial_xmin_horizon, running->oldestRunningXid)));
 
-
-		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+		cutoff = builder->initial_xmin_horizon;
+		TransactionIdRetreat(cutoff);
+		SnapBuildWaitSnapshot(cutoff);
 
 		return true;
 	}
@@ -1316,7 +1319,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1340,7 +1343,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1373,8 +1376,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 }
 
 /* ---
- * Iterate through xids in record, wait for all older than the cutoff to
- * finish.  Then, if possible, log a new xl_running_xacts record.
+ * Wait for all transactions older than or equal to the cutoff to finish.
+ * Then, if possible, log a new xl_running_xacts record.
  *
  * This isn't required for the correctness of decoding, but to:
  * a) allow isolationtester to notice that we're currently waiting for
@@ -1384,13 +1387,31 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  * ---
  */
 static void
-SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+SnapBuildWaitSnapshot(TransactionId cutoff)
 {
-	int			off;
+	RunningTransactions running;
+
+	if (RecoveryInProgress())
+	{
+		/*
+		 * During recovery, we have no mechanism for waiting for an XID to
+		 * finish, and we cannot create new running-xacts records either.
+		 */
+		return;
+	}
+
+	running = GetRunningTransactionData();
+
+	/*
+	 * GetRunningTransactionData returns with XidGenLock and ProcArrayLock
+	 * held, but we don't need them.
+	 */
+	LWLockRelease(XidGenLock);
+	LWLockRelease(ProcArrayLock);
 
-	for (off = 0; off < running->xcnt; off++)
+	for (int i = 0; i < running->xcnt; i++)
 	{
-		TransactionId xid = running->xids[off];
+		TransactionId xid = running->xids[i];
 
 		/*
 		 * Upper layers should prevent that we ever need to wait on ourselves.
@@ -1400,7 +1421,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 		if (TransactionIdIsCurrentTransactionId(xid))
 			elog(ERROR, "waiting for ourselves");
 
-		if (TransactionIdFollows(xid, cutoff))
+		if (TransactionIdFollowsOrEquals(xid, cutoff))
 			continue;
 
 		XactLockTableWait(xid, NULL, NULL, XLTW_None);
@@ -1412,10 +1433,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 	 * wait for bgwriter or checkpointer to log one.  During recovery we can't
 	 * enforce that, so we'll have to wait.
 	 */
-	if (!RecoveryInProgress())
-	{
-		LogStandbySnapshot();
-	}
+	LogStandbySnapshot();
 }
 
 #define SnapBuildOnDiskConstantSize \
-- 
2.39.5

0004-Remove-the-now-unused-xids-array-from-xl_running_xac.patchtext/x-patch; charset=UTF-8; name=0004-Remove-the-now-unused-xids-array-from-xl_running_xac.patchDownload

From aae52b88ce67adc0261ebfeafb8496ed9e88d240 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 16:40:57 +0300
Subject: [PATCH 4/5] Remove the now-unused xids array from xl_running_xacts

We still generate running-xacts records, because they are still needed
to initialize the snapshot in logical decoding.
---
 src/backend/access/rmgrdesc/standbydesc.c   | 18 ------------
 src/backend/replication/logical/snapbuild.c |  8 +++---
 src/backend/storage/ipc/standby.c           | 32 +++++----------------
 src/include/storage/standby.h               |  2 --
 src/include/storage/standbydefs.h           | 16 +++++++----
 5 files changed, 21 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 32e509a4006..99f08beb4a8 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -19,28 +19,10 @@
 static void
 standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
-	int			i;
-
 	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
 					 xlrec->oldestRunningXid);
-	if (xlrec->xcnt > 0)
-	{
-		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
-		for (i = 0; i < xlrec->xcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[i]);
-	}
-
-	if (xlrec->subxid_overflow)
-		appendStringInfoString(buf, "; subxid overflowed");
-
-	if (xlrec->subxcnt > 0)
-	{
-		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
-		for (i = 0; i < xlrec->subxcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
-	}
 }
 
 void
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 31da0832cc3..cac3ffe577e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1316,8 +1316,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial starting point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
@@ -1340,8 +1340,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial consistent point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index e02c9ab842d..6ed46bed033 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1337,9 +1337,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xl_running_xacts xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.xcnt = CurrRunningXacts->xcnt;
-	xlrec.subxcnt = CurrRunningXacts->subxcnt;
-	xlrec.subxid_overflow = (CurrRunningXacts->subxid_status != SUBXIDS_IN_ARRAY);
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
@@ -1347,31 +1344,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	/* Header */
 	XLogBeginInsert();
 	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData((char *) (&xlrec), MinSizeOfXactRunningXacts);
-
-	/* array of TransactionIds */
-	if (xlrec.xcnt > 0)
-		XLogRegisterData((char *) CurrRunningXacts->xids,
-						 (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
+	XLogRegisterData((char *) (&xlrec), SizeOfXactRunningXacts);
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
-	if (xlrec.subxid_overflow)
-		elog(DEBUG2,
-			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
-	else
-		elog(DEBUG2,
-			 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+	elog(DEBUG2,
+		 "logging running transaction bounds (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+		 LSN_FORMAT_ARGS(recptr),
+		 CurrRunningXacts->oldestRunningXid,
+		 CurrRunningXacts->latestCompletedXid,
+		 CurrRunningXacts->nextXid);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index cce0bc521e7..9d5a298a392 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -60,8 +60,6 @@ extern void StandbyReleaseLockTree(TransactionId xid,
 extern void StandbyReleaseAllLocks(void);
 extern void StandbyReleaseOldLocks(TransactionId oldxid);
 
-#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
-
 
 /*
  * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index fe12f463a86..d8582094472 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -42,20 +42,24 @@ typedef struct xl_standby_locks
 } xl_standby_locks;
 
 /*
- * When we write running xact data to WAL, we use this structure.
+ * Data included in an XLOG_RUNNING_XACTS record.
+ *
+ * This used to include a list of running XIDs, hence the name, but nowadays
+ * this only contains the min and max bounds of the transactions that were
+ * running when the record was written.  They are needed to initialize logical
+ * decoding.  They are also used in hot standby to prune information about old
+ * running transactions, in case the the primary didn't write a COMMIT/ABORT
+ * record for some reason.
  */
 typedef struct xl_running_xacts
 {
-	int			xcnt;			/* # of xact ids in xids[] */
-	int			subxcnt;		/* # of subxact ids in xids[] */
-	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* xid from TransamVariables->nextXid */
 	TransactionId oldestRunningXid; /* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
-
-	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
 
+#define SizeOfXactRunningXacts sizeof(xl_running_xacts)
+
 /*
  * Invalidations for standby, currently only when transactions without an
  * assigned xid commit.
-- 
2.39.5

0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-lookup.patchtext/x-patch; charset=UTF-8; name=0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-lookup.patchDownload

From 055cb55f1f0729bbe20aef98a64a6e7e9a9cd839 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 29 Oct 2024 15:54:13 +0200
Subject: [PATCH 5/5] Add a cache to Snapshot to avoid repeated CSN lookups

Cache the status of all XIDs that have been looked up in the CSN log
in the SnapshotData. This avoids having to go the CSN log in the
common case that the same XIDs are looked up over and over again.
---
 src/backend/utils/time/snapmgr.c | 84 +++++++++++++++++++++++++++++---
 src/include/utils/snapshot.h     |  8 +++
 2 files changed, 86 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index da82def8461..04a55e25f64 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -66,6 +66,35 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Define a radix tree implementation to cache CSN lookups in a snapshot.
+ *
+ * We need only one bit of information for each XID stored in the cache: was
+ * the XID running or not.  However, the radix tree implementation uses 8
+ * bytes for each entry (on 64-bit machines) even if the value type is smaller
+ * than that.  To reduce memory usage, we use uint64 as the value type, and
+ * store multiple XIDs in each value.
+ *
+ * The 64-bit value word holds two bits for each XID: whether the XID is
+ * present in the cache or not, and if it's present, whether it's considered
+ * as in-progress by the snapshot or not.  So each entry in the radix tree
+ * holds the status for 32 XIDs.
+ */
+#define RT_PREFIX inprogress_cache
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define INPROGRESS_CACHE_BITS 2
+#define INPROGRESS_CACHE_XIDS_PER_WORD 32
+
+#define INPROGRESS_CACHE_XID_IS_CACHED(word, slotno) \
+	((((word) & (1 << (slotno)))) != 0)
+
+#define INPROGRESS_CACHE_XID_IS_IN_PROGRESS(word, slotno) \
+	((((word) & (1 << ((slotno) + 1)))) != 0)
 
 /*
  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
@@ -595,6 +624,12 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->copied = true;
 	newsnap->snapXactCompletionCount = 0;
 
+	/*
+	 * TODO: If we had a separate reference count on the cache, we could share
+	 * it between the copies.
+	 */
+	newsnap->inprogress_cache = NULL;
+
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
 	{
@@ -635,6 +670,8 @@ FreeSnapshot(Snapshot snapshot)
 	Assert(snapshot->active_count == 0);
 	Assert(snapshot->copied);
 
+	if (snapshot->inprogress_cache)
+		inprogress_cache_free(snapshot->inprogress_cache);
 	pfree(snapshot);
 }
 
@@ -1807,6 +1844,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
 	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
+	snapshot->inprogress_cache = NULL;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1907,8 +1945,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 			 * If xid was indeed a subxact, we might now have an xid < xmin,
 			 * so recheck to avoid an array scan.  No point in rechecking
 			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
+			 */			if (TransactionIdPrecedes(xid, snapshot->xmin))
 				return false;
 		}
 
@@ -1917,12 +1954,47 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
+		XLogRecPtr	csn;
+		bool		inprogress;
+		uint64	   *cache_entry = NULL;
+		uint64		cache_word = 0;
+		uint64		wordno = xid / INPROGRESS_CACHE_XIDS_PER_WORD;
+		uint64		slotno = (xid % INPROGRESS_CACHE_XIDS_PER_WORD) * INPROGRESS_CACHE_BITS;
+
+		/* Check cache first */
+		if (snapshot->inprogress_cache)
+		{
+			cache_entry = inprogress_cache_find(snapshot->inprogress_cache, wordno);
+			if (cache_entry)
+			{
+				cache_word = *cache_entry;
+				if (INPROGRESS_CACHE_XID_IS_CACHED(cache_word, slotno))
+					return INPROGRESS_CACHE_XID_IS_IN_PROGRESS(cache_word, slotno);
+			}
+		}
+
+		/* Not found in cache, look up the CSN */
+		csn = CSNLogGetCSNByXid(xid);
+		inprogress = (csn == InvalidXLogRecPtr || csn > snapshot->snapshotCsn);
+
+		/* Update the cache word, and store it back to the radix tree */
+		cache_word |= (1 << slotno); /* cached */
+		if (inprogress)
+			cache_word |= (1 << (slotno + 1)); /* in-progress */
 
-		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
-			return false;
+		if (!snapshot->inprogress_cache)
+		{
+			MemoryContext cache_cxt = AllocSetContextCreate(TopTransactionContext,
+															"snapshot inprogress cache context",
+															ALLOCSET_DEFAULT_SIZES);
+			snapshot->inprogress_cache = inprogress_cache_create(cache_cxt);
+		}
+		if (cache_entry)
+			*cache_entry = cache_word;
 		else
-			return true;
+			inprogress_cache_set(snapshot->inprogress_cache, wordno, &cache_word);
+
+		return inprogress;
 	}
 
 	return false;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 1fda5b06f67..9fb07f82e4f 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -122,6 +122,8 @@ typedef struct SnapshotData *Snapshot;
 
 #define InvalidSnapshot		((Snapshot) NULL)
 
+struct inprogress_cache_radix_tree;		/* private to snapmgr.c */
+
 /*
  * Struct representing all kind of possible snapshots.
  *
@@ -188,6 +190,12 @@ typedef struct SnapshotData
 	 */
 	XLogRecPtr	snapshotCsn;
 
+	/*
+	 * Cache of XIDs known to be running or not according to the
+	 * snapshot. Used in snapshots taken during recovery.
+	 */
+	struct inprogress_cache_radix_tree *inprogress_cache;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.5

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Heikki Linnakangas (#8)

Re: CSN snapshots in hot standby

On 29/10/2024 18:33, Heikki Linnakangas wrote:

I added two tests to the test suite:
                                master     patched
insert-all-different-xids:     0.00027    0.00019 s / iteration
insert-all-different-subxids: 0.00023    0.00020 s / iteration

insert-all-different-xids: Open 1000 connections, insert one row in
each, and leave the transactions open. In the replica, select all the rows

insert-all-different-subxids: The same, but with 1 transaction with 1000
subxids.

The point of these new tests is to test the scenario where the cache
doesn't help and just adds overhead, because each XID is looked up only
once. Seems to be fine. Surprisingly good actually; I'll do some more
profiling on that to understand why it's even faster than 'master'.

Ok, I did some profiling and it makes sense:

In the insert-all-different-xids test on 'master', we spend about 60& of
CPU time in XidInMVCCSnapshot(), doing pg_lfind32() over the subxip
array. We should probably sort the array and use a binary search if it's
large or something...

With these patches, instead of the pg_lfind32() over subxip array, we
perform one CSN SLRU lookup instead, and the page is cached. There's
locking overhead etc. with that, but it's still cheaper than the
pg_lfind32().

In the insert-all-different-subxids test on 'master', the subxip array
is overflowed, so we call SubTransGetTopmostTransaction() on each XID.
That's performs two pg_subtrans lookups for each XID, first for the
subxid, then for the parent. With these patches, we perform just one
SLRU lookup, in pg_csnlog, which is faster.

Now the downside of this new cache: Since it has no size limit, if you
keep looking up different XIDs, it will keep growing until it holds all
the XIDs between the snapshot's xmin and xmax. That can take a lot of
memory in the worst case. Radix tree is pretty memory efficient, but
holding, say 1 billion XIDs would probably take something like 500 MB of
RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
radix tree nodes). That's per snapshot, so if you have a lot of 60&
connections, maybe even with multiple snapshots each, that can add up.

I'm inclined to accept that memory usage. If we wanted to limit the size
of the cache, would need to choose a policy on how to truncate it
(delete random nodes?), what the limit should be etc. But I think it'd
be rare to hit those cases in practice. If you have a one billion XID
old transaction running in the primary, you probably have bigger
problems already.

I'd love to hear some thoughts on this caching behavior. Is it
acceptable to let the cache grow, potentially to very large sizes in the
worst cases? Or do we need to make it more complicated and implement
some eviction policy?

--
Heikki Linnakangas
Neon (https://neon.tech)

#10

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: Heikki Linnakangas (#8)

Re: CSN snapshots in hot standby

On Tue, Oct 29, 2024 at 11:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

master patched
few-xacts: 0.0041 0.0041 s / iteration
many-xacts: 0.0042 0.0042 s / iteration
many-xacts-wide-apart: 0.0043 0.0045 s / iteration

Hi Heikki,

I have some thoughts about behavior of the cache that might not be
apparent in this test:

The tree is only as tall as need be to store the highest non-zero
byte. On a newly initialized cluster, the current txid is small. The
first two test cases here will result in a tree with height of 2. The
last one will have a height of 3, and its runtime looks a bit higher,
although that could be just noise or touching more cache lines. It
might be worth it to try a test run while forcing the upper byte of
the keys to be non-zero (something like "key | (1<<30), so that the
tree always has a height of 4. That would match real-world conditions
more closely. If need be, there are a couple things we can do to
optimize node dispatch and touch fewer cache lines.

I added two tests to the test suite:
master patched
insert-all-different-xids: 0.00027 0.00019 s / iteration
insert-all-different-subxids: 0.00023 0.00020 s / iteration

The point of these new tests is to test the scenario where the cache
doesn't help and just adds overhead, because each XID is looked up only
once. Seems to be fine. Surprisingly good actually; I'll do some more
profiling on that to understand why it's even faster than 'master'.

These tests use a sequential scan. For things like primary key
lookups, I wonder if the overhead of creating and destroying the
tree's memory contexts for the (not used again) cache would be
noticeable. If so, it wouldn't be too difficult to teach radix tree to
create the larger contexts lazily.

Now the downside of this new cache: Since it has no size limit, if you
keep looking up different XIDs, it will keep growing until it holds all
the XIDs between the snapshot's xmin and xmax. That can take a lot of
memory in the worst case. Radix tree is pretty memory efficient, but
holding, say 1 billion XIDs would probably take something like 500 MB of
RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
radix tree nodes). That's per snapshot, so if you have a lot of
connections, maybe even with multiple snapshots each, that can add up.

I'm inclined to accept that memory usage. If we wanted to limit the size
of the cache, would need to choose a policy on how to truncate it
(delete random nodes?), what the limit should be etc. But I think it'd
be rare to hit those cases in practice. If you have a one billion XID
old transaction running in the primary, you probably have bigger
problems already.

I don't have a good sense of whether it needs a limit or not, but if
we decide to add one as a precaution, maybe it's enough to just blow
the cache away when reaching some limit? Being smarter than that would
need some work.

--
John Naylor
Amazon Web Services

#11

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: John Naylor (#10)

5 attachment(s)

Re: CSN snapshots in hot standby

On 20/11/2024 15:33, John Naylor wrote:

The tree is only as tall as need be to store the highest non-zero
byte. On a newly initialized cluster, the current txid is small. The
first two test cases here will result in a tree with height of 2. The
last one will have a height of 3, and its runtime looks a bit higher,
although that could be just noise or touching more cache lines. It
might be worth it to try a test run while forcing the upper byte of
the keys to be non-zero (something like "key | (1<<30), so that the
tree always has a height of 4. That would match real-world conditions
more closely. If need be, there are a couple things we can do to
optimize node dispatch and touch fewer cache lines.

Good point. In some quick testing with the 'few-xacts' test, the
difference between the worst case and best case is about 10%. That can
be avoided very easily: instead of using the xid as the key, use (xmax -
xid). That way, the highest key used is the distance between the
snapshot's xmin and xmax rather than the absolute xid values.

I added two tests to the test suite:
master patched
insert-all-different-xids: 0.00027 0.00019 s / iteration
insert-all-different-subxids: 0.00023 0.00020 s / iteration

The point of these new tests is to test the scenario where the cache
doesn't help and just adds overhead, because each XID is looked up only
once. Seems to be fine. Surprisingly good actually; I'll do some more
profiling on that to understand why it's even faster than 'master'.

These tests use a sequential scan. For things like primary key
lookups, I wonder if the overhead of creating and destroying the
tree's memory contexts for the (not used again) cache would be
noticeable. If so, it wouldn't be too difficult to teach radix tree to
create the larger contexts lazily.

I played with that a little, but it doesn't make much difference. It is
measurable when you look at the profile with 'perf': XidInMVCCSnapshot()
takes about 4% of CPU time in the worst case, where you perform a very
simple SeqScan over a table with just one row, and you perform only one
visibility check in each scan. I could squeeze that down to around 3% by
allocating the contexts lazily. But that's not significant in the grand
scheme of things. Might still be worth doing, but it's not a blocker for
this work, it can be discussed separately.

I did find one weird thing that makes a big difference: I originally
used AllocSetContextCreate(..., ALLOCSET_DEFAULT_SIZES) for the radix
tree's memory context. With that, XidInMVCCSnapshot() takes about 19% of
the CPU time in that test. When I changed that to ALLOCSET_SMALL_SIZES,
it falls down to the 4% figure. And weird enough, in both cases the time
seems to be spent in the malloc() call from SlabContextCreate(), not
AllocSetContextCreate(). I think doing this particular mix of large and
small allocations with malloc() somehow poisons its free list or
something. So this is probably heavily dependent on the malloc()
implementation. In any case, ALLOCSET_SMALL_SIZES is clearly a better
choice here, even without that effect.

One way to eliminate all that would be to start with a tiny cache with
e.g. 4 elements with linear probing, and switch to the radix tree only
when that fills up. But it doesn't seem worth the extra code right now.

Here's a new version with the those small changes:
- use xmax - xid as the cache key
- use ALLOCSET_SMALL_SIZES for the radix tree's memory context

Thanks for looking into this!

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v5-0001-XXX-add-perf-test.patchtext/x-patch; charset=UTF-8; name=v5-0001-XXX-add-perf-test.patchDownload

From afd633f498ed986da6cadae8a0c9dd6575399ef7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 21 Oct 2024 14:07:38 +0300
Subject: [PATCH v5 1/5] XXX: add perf test

This is not intended to be merged. But it's been useful to have this
in the tree for some quick perf testing during development.

To run it, I've used:

(cd build-release && ninja &&  rm -rf tmp_install && meson test --suite setup --suite test_misc; grep TEST testrun/test_misc/000_csn_perf/log/regress_log_000_csn_perf )

It runs the other test_misc tests concurrently, but they finish a lot
faster so they don't affect the results much.
---
 src/test/modules/test_misc/meson.build       |   1 +
 src/test/modules/test_misc/t/000_csn_perf.pl | 337 +++++++++++++++++++
 2 files changed, 338 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/000_csn_perf.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa751a..e55e80af54 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
+      't/000_csn_perf.pl',
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
diff --git a/src/test/modules/test_misc/t/000_csn_perf.pl b/src/test/modules/test_misc/t/000_csn_perf.pl
new file mode 100644
index 0000000000..3915878a40
--- /dev/null
+++ b/src/test/modules/test_misc/t/000_csn_perf.pl
@@ -0,0 +1,337 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Verify that ALTER TABLE optimizes certain operations as expected
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(time);
+
+my $duration = 15; # seconds
+my $miniterations = 3;
+
+# Initialize a test cluster
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+# Turn message level up to DEBUG1 so that we get the messages we want to see
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->append_conf('postgresql.conf', 'max_connections = 1005');
+$primary->start;
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->append_conf('postgresql.conf', "shared_buffers='1 GB'");
+$replica->start;
+
+sub wait_catchup
+{
+	my ($primary, $replica) = @_;
+	
+	my $primary_lsn =
+	  $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+	my $caughtup_query =
+	  "SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+	$replica->poll_query_until('postgres', $caughtup_query)
+	  or die "Timed out while waiting for standby to catch up";
+}
+
+sub repeat_and_time_sql
+{
+  	my ($name, $node, $sql) = @_;
+
+	my $session =  $node->background_psql('postgres', on_error_die => 1);
+	$session->query_safe("SET max_parallel_workers_per_gather=0");
+
+	my $iterations = 0;
+
+	my $now;
+	my $elapsed;
+    my $begin_time = time();
+	while (1) {
+		$session->query_safe($sql);
+		$now = time();
+		$iterations = $iterations + 1;
+
+		$elapsed = $now - $begin_time;
+		if ($elapsed > $duration && $iterations >= $miniterations) {
+			last;
+		}
+	}
+
+	my $periter = $elapsed / $iterations;
+
+	pass ("TEST $name: $elapsed s, $iterations iterations, $periter s / iteration");
+}
+
+
+$primary->safe_psql('postgres', "CREATE TABLE little (i int);");
+$primary->safe_psql('postgres', "INSERT INTO little VALUES (1);");
+
+sub consume_xids
+{
+	my ($node) = @_;
+
+	my $session = $node->background_psql('postgres', on_error_die => 1);
+	for(my $i = 0; $i < 20; $i++) {
+		$session->query_safe(q{do $$
+  begin
+    for i in 1..50 loop
+      begin
+        DELETE from little;
+        perform 1 / 0;
+      exception
+        when division_by_zero then perform 0 /* do nothing */;
+        when others then raise 'fail: %', sqlerrm;
+      end;
+    end loop;
+  end
+$$;});
+	}
+	$session->quit;
+}
+
+# TEST few-xacts
+#
+# Cycle through 4 different top-level XIDs
+#
+# 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 4;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts
+#
+# like few-xacts, but we cycle through 100 different XIDs instead of 4.
+#
+# 1001, 1002, 1003, ... 1100, 1001, 1002, 1003, ... 1100  ....
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts-wide-apart
+#
+# like many-xacts, but the XIDs are more spread out, so that they don't fit in the
+# SLRU caches.
+#
+# 1000, 2000, 3000, 4000, ....
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+
+		consume_xids($primary);
+
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts-wide-apart", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: few-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 4;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+
+# TEST: many-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: many-subxacts-wide-apart
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		consume_xids($primary);
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts-wide-apart", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-xids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+	my @primary_sessions = ();
+	my $num_connections = 1000;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("INSERT INTO tbl VALUES ($i)");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-xids", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-subxids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i; INSERT INTO tbl VALUES($i); release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-subxids", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+done_testing();
-- 
2.39.5

v5-0002-Use-CSN-snapshots-during-Hot-Standby.patchtext/x-patch; charset=UTF-8; name=v5-0002-Use-CSN-snapshots-during-Hot-Standby.patchDownload

From e58543b92bcbfc0061964341cecf93fa813dc77e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:26:40 +0300
Subject: [PATCH v5 2/5] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |    1 +
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  474 ++++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    1 +
 src/backend/access/transam/xact.c             |  138 +-
 src/backend/access/transam/xlog.c             |  118 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1512 ++---------------
 src/backend/storage/ipc/standby.c             |  102 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   37 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/utils/snapshot.h                  |    7 +
 31 files changed, 821 insertions(+), 1768 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5d0deaba61..7905a91412 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -581,6 +581,7 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    now perform minimal checking on a standby by always using nextXid, this
  *    approach is better than nothing and will at least catch extremely broken
  *    cases where a xid is in the future.
+ *    XXX KnownAssignedXids is gone.
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 889cb955c1..128486e751 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -424,17 +424,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -462,18 +451,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -505,9 +482,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..2520d77c7c 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..1188a78c4a
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,474 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit record LSNs of finished transactions
+ *
+ * This module provides an SLRU to store the LSN of the commit record of each
+ * transaction. CSN stands for Commit Sequence Number, and in principle we
+ * could use a separate counter that is incremented at every commit. For
+ * simplicity, though, we use the commit records LSN as the sequence number.
+ *
+ * Like pg_subtrans, this mapping need to be kept only for xid's greater then
+ * oldestXmin, and doesn't need to be preserved over crashes.  Also, this is
+ * only needed in hot standby mode, and immediately after exiting hot standby
+ * mode, until all old snapshots taken during standby mode are gone.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CSNLog control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+								TransactionId *subxids,
+								XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+							   int slotno);
+
+
+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, in logical XID order,
+ * representing subtransactions in the tree of XIDs. In various cases nsubxids
+ * may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the CSN log for all
+ * entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as CSNLogSetCSN()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids, TransactionId *subxids,
+					XLogRecPtr commitLsn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i], commitLsn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, commitLsn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	// FIXME: skip if not InHotStandby?
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	//SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ * All committed transactions are stamped with 'csn'
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus	status = TransactionIdGetStatus(*xid, &dummy);
+
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid, and after
+ * initializing the CLOG.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'. ('csn'
+ * is supposed to be an "older than anything we'll ever need to compare with")
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we initialize
+	 * the currently-active page(s) to zeroes during startup. Whenever we
+	 * advance into a new page, ExtendCSNLog will likewise zero the new page
+	 * without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..cf41df2971 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 75b5325df8..93c4d495e4 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have the CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 49be1df91c..8729ce2054 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1959,20 +1960,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -2000,34 +1994,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index cfe8c6cf8d..b074423654 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1eccb78ddc..cab9edc48b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -209,7 +210,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -249,13 +249,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -531,18 +524,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -635,7 +616,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -679,20 +659,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -728,59 +694,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData((char *) unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1480,11 +1393,11 @@ RecordTransactionCommit(void)
 	 * temp tables will be lost anyway, unlogged tables will be truncated and
 	 * HOT pruning will be done again later. (Given the foregoing, you might
 	 * think that it would be unnecessary to emit the XLOG record at all in
-	 * this case, but we don't currently try to do that.  It would certainly
-	 * cause problems at least in Hot Standby mode, where the
-	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
-	 * might be OK to skip it only when wal_level < replica, but for now we
-	 * don't.)
+	 * this case, but we don't currently try to do that.  It might cause
+	 * inefficiencies in Hot Standby mode, if nothing else, where the
+	 * commit/abort records allow advancing the xmin horizon for new
+	 * snapshots. It might be OK to skip it only when wal_level < replica, but
+	 * for now we don't.)
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
@@ -1952,13 +1865,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2141,12 +2047,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6142,7 +6042,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6162,6 +6062,12 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * Mark the CSNLOG first.  The transaction won't become visible to new
+		 * snapshots until the call to ProcArrayRecoveryEndTransaction().
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6174,9 +6080,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * Make the commit visible to new snapshots in the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6282,7 +6188,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6300,13 +6206,15 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/* Note: we don't need to update the CSN log on abort. */
+
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
 
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6414,14 +6322,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..a3ba04fbc8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -951,8 +952,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5182,6 +5181,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5783,16 +5783,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5807,39 +5807,17 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
-			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
-			 */
-			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
 
-				/* Update pg_subtrans entries for any prepared transactions */
-				StandbyRecoverPreparedTransactions();
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
 
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_status = SUBXIDS_IN_SUBTRANS;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-			}
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
+			/* Update pg_subtrans entries for any prepared transactions */
+			if (wasShutdown)
+				StandbyRecoverPreparedTransactions();
 		}
 
 		/*
@@ -5923,7 +5901,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -6089,9 +6067,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6177,12 +6164,12 @@ StartupXLOG(void)
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
 	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
-	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * any session building a snapshot will not rely on the CSN log as
 	 * RecoveryInProgress() would return false at this stage.  This is
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -6954,7 +6941,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7346,7 +7333,10 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
@@ -7519,6 +7509,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7815,7 +7806,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8300,41 +8294,17 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			/* Update pg_subtrans entries for any prepared transactions */
 			StandbyRecoverPreparedTransactions();
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_status = SUBXIDS_IN_SUBTRANS;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
@@ -8398,6 +8368,16 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Remember the oldest XID that was running at the time.  Normally,
+		 * all transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6994b7828..7b2475e4e2 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1975,10 +1975,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2255,7 +2254,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3700,9 +3699,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3968,9 +3964,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe0..bf08c60e93 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index ef6f98ebcd..a975865fdd 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e73576ad12..c4f9feed64 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6a4da3266..734865ce62 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854f..49c2ced2d4 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -121,6 +122,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -285,6 +287,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 36610a1c7e..c82e8d8c43 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -73,22 +64,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -99,6 +76,21 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+#define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+
+/*
+ * TOTAL_MAX_CACHED_SUBXIDS is the total number of XIDs that fits in the proc
+ * array, as top XIDs and in the subxids caches.
+ *
+ * Local data structures are also created in various backends during
+ * GetSnapshotData(), TransactionIdIsInProgress() and
+ * GetRunningTransactionData(). All of the main structures created in those
+ * functions must be identically sized, since we may at times copy the whole
+ * of the data structures around.
+ */
+#define TOTAL_MAX_CACHED_SUBXIDS \
+	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+
 /*
  * State for the GlobalVisTest* family of functions. Those functions can
  * e.g. be used to decide if a deleted row can be removed without violating
@@ -254,17 +246,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -278,17 +259,8 @@ static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -315,7 +287,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -325,7 +297,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -338,28 +310,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -383,31 +339,6 @@ ProcArrayShmemSize(void)
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
-	/*
-	 * During Hot Standby processing we have a data structure called
-	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetSnapshotData(),
-	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
-	 * main structures created in those functions must be identically sized,
-	 * since we may at times copy the whole of the data structures around. We
-	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
-	 *
-	 * Ideally we'd only create this structure if we were actually doing hot
-	 * standby in the current run, but we don't know that yet at the time
-	 * shared memory is being set up.
-	 */
-#define TOTAL_MAX_CACHED_SUBXIDS \
-	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
-
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -434,31 +365,12 @@ ProcArrayShmemInit(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1022,355 +934,35 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
 	/*
 	 * Remove stale locks, if any.
 	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (running->subxid_status != SUBXIDS_MISSING || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
+	StandbyReleaseOldLocks(oldestRunningXID);
 
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_status == SUBXIDS_MISSING)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-
-		/*
-		 * If the 'xids' array didn't include all subtransactions, we have to
-		 * mark any snapshots taken as overflowed.
-		 */
-		if (running->subxid_status == SUBXIDS_IN_SUBTRANS)
-			procArray->lastOverflowedXid = latestObservedXid;
-		else
-		{
-			Assert(running->subxid_status == SUBXIDS_IN_ARRAY);
-			procArray->lastOverflowedXid = InvalidTransactionId;
-		}
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1378,23 +970,24 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction is known to have committed
+ * or aborted, otherwise it's considered as running.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * if that is running according to ProcGlobal->xids[].
  * This is the slowest way, but sadly it has to be done always if the others
  * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
- * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
- * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
+ * ProcArrayLock has to be held while we do 2 and 3.  If we save the top Xids
+ * while doing 2 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
  * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
@@ -1435,6 +1028,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1451,12 +1066,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1551,33 +1161,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1851,8 +1434,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1865,11 +1447,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2188,7 +1773,7 @@ GetSnapshotData(Snapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2368,27 +1953,8 @@ GetSnapshotData(Snapshot snapshot)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2396,14 +1962,17 @@ GetSnapshotData(Snapshot snapshot)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * (XXX: not sure that's strictly required.)
+		 * This is what determines which transactions we consider finished and
+		 * which are still in progress.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2519,6 +2088,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->lsn = InvalidXLogRecPtr;
 	snapshot->whenTaken = 0;
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2674,9 +2245,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * The returned data structure is statically allocated; caller should not
  * modify it, and must not assume it is valid past the next call.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
- *
  * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
@@ -2707,6 +2275,7 @@ GetRunningTransactionData(void)
 	int			subcount;
 	bool		suboverflowed;
 
+	/* This is never executed during recovery */
 	Assert(!RecoveryInProgress());
 
 	/*
@@ -2873,15 +2442,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2902,11 +2472,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2915,6 +2487,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2993,8 +2568,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
-	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
-	 * machinery can miss values and return an older value than is safe.
+	 * *not* use oldest_running_primary_xid since the XID tracking machinery
+	 * can miss values and return an older value than is safe.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3412,6 +2987,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4083,14 +3661,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4337,61 +3915,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4406,7 +3929,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4424,38 +3947,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4467,781 +3971,61 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 }
 
 /*
- * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
+ * ProcArrayRecoveryEndTransaction
+ *
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction(). The transaction becomes visible to any new
+ * snapshots taken after this. 'max_xid' is the highest (sub)XID of the
+ * committed transaction, and 'lsn' is LSN of the commit record.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * The transaction and all its subtransactions have been already marked as
+ * committed in the CLOG and in the CSNLOG.
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * If this was the the oldest XID that was still running, advance it.
+	 * This is important for advancing the global xmin, which avoids
+	 * unnecessary recovery conflicts
+	 *
+	 * No locking required because this runs in the startup process.
+	 *
+	 * XXX: the caller actually has a list of XIDs that just committed. We
+	 * could save some clog lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 25267f0f85..e02c9ab842 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -1167,7 +1162,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1182,18 +1177,21 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
 
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_status = xlrec->subxid_overflow ? SUBXIDS_MISSING : SUBXIDS_IN_ARRAY;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
-
-		ProcArrayApplyRecoveryInfo(&running);
+		/*
+		 * Remember the oldest XID that was running at the time. Normally, all
+		 * transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1224,50 +1222,46 @@ standby_redo(XLogReaderState *record)
  *
  * This is used for Hot Standby as follows:
  *
- * We can move directly to STANDBY_SNAPSHOT_READY at startup if we
- * start from a shutdown checkpoint because we know nothing was running
- * at that time and our recovery snapshot is known empty. In the more
- * typical case of an online checkpoint we need to jump through a few
- * hoops to get a correct recovery snapshot and this requires a two or
- * sometimes a three stage process.
+ * We can enter hot standby mode and start accepting read-only queries
+ * immediately at startup if we start from a shutdown checkpoint, because we
+ * know nothing was running at that time and our recovery snapshot is known
+ * empty. In the more typical case of an online checkpoint, the checkpoint
+ * record doesn't contain all the necessary information about running
+ * transaction state, and we need to jump through a few hoops to get a correct
+ * recovery snapshot.
  *
- * The initial snapshot must contain all running xids and all current
- * AccessExclusiveLocks at a point in time on the standby. Assembling
- * that information while the server is running requires many and
- * various LWLocks, so we choose to derive that information piece by
- * piece and then re-assemble that info on the standby. When that
- * information is fully assembled we move to STANDBY_SNAPSHOT_READY.
+ * The initial snapshot must contain all current AccessExclusiveLocks at a
+ * point in time on the standby. Assembling that information while the server
+ * is running requires many and various LWLocks, so we choose to derive that
+ * information piece by piece and then re-assemble that info on the standby.
  *
- * Since locking on the primary when we derive the information is not
- * strict, we note that there is a time window between the derivation and
- * writing to WAL of the derived information. That allows race conditions
- * that we must resolve, since xids and locks may enter or leave the
- * snapshot during that window. This creates the issue that an xid or
- * lock may start *after* the snapshot has been derived yet *before* the
- * snapshot is logged in the running xacts WAL record. We resolve this by
- * starting to accumulate changes at a point just prior to when we derive
- * the snapshot on the primary, then ignore duplicates when we later apply
- * the snapshot from the running xacts record. This is implemented during
- * CreateCheckPoint() where we use the logical checkpoint location as
- * our starting point and then write the running xacts record immediately
- * before writing the main checkpoint WAL record. Since we always start
- * up from a checkpoint and are immediately at our starting point, we
- * unconditionally move to STANDBY_INITIALIZED. After this point we
- * must do 4 things:
+ * Since locking on the primary when we derive the information is not strict,
+ * there is a time window between the derivation and writing to WAL of the
+ * derived information. That allows race conditions that we must resolve,
+ * since xids and locks may enter or leave the snapshot during that
+ * window. This creates the issue that an xid or lock may start *after* the
+ * snapshot has been derived yet *before* the snapshot is logged in the
+ * running xacts WAL record. We resolve this by starting to accumulate changes
+ * at a point just prior to when we collect the lock information on the
+ * primary, then ignore duplicates when we later apply the snapshot from the
+ * running xacts record. This is implemented during CreateCheckPoint() where
+ * we use the logical checkpoint location as our starting point and then write
+ * the running xacts record immediately before writing the main checkpoint WAL
+ * record. Since we always start up from a checkpoint's redo pointer, we will
+ * always see a running-xacts record between before reaching the checkpoint
+ * record, and can immediately enter hot standby mode. After this point we
+ * must do 3 things:
  *	* move shared nextXid forwards as we see new xids
  *	* extend the clog and subtrans with each new xid
- *	* keep track of uncommitted known assigned xids
  *	* keep track of uncommitted AccessExclusiveLocks
  *
- * When we see a commit/abort we must remove known assigned xids and locks
- * from the completing transaction. Attempted removals that cannot locate
- * an entry are expected and must not cause an error when we are in state
- * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
- * KnownAssignedXidsRemove().
- *
- * Later, when we apply the running xact data we must be careful to ignore
- * transactions already committed, since those commits raced ahead when
- * making WAL entries.
+ * When we see a commit/abort we must advance oldest_running_primary_xid and
+ * remove locks from the completing transaction. Attempted removals that
+ * cannot locate an entry are expected and must not cause an error until we
+ * have seen the running-xacts record. (We don't throw an error even after
+ * that, because whatever the reason was, after the transaction has completed
+ * the issue has already been resolved anyway.) This is implemented in
+ * StandbyReleaseLocks().
  *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab..60f93a39a4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -130,6 +130,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -166,6 +167,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72..aaceab7771 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -357,6 +357,7 @@ SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> s
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index d772544377..ffbfae84b8 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..da82def846 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -48,6 +48,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -201,6 +202,7 @@ typedef struct SerializedSnapshotData
 	CommandId	curcid;
 	TimestampTz whenTaken;
 	XLogRecPtr	lsn;
+	XLogRecPtr	snapshotCsn;
 } SerializedSnapshotData;
 
 /*
@@ -1729,6 +1731,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.curcid = snapshot->curcid;
 	serialized_snapshot.whenTaken = snapshot->whenTaken;
 	serialized_snapshot.lsn = snapshot->lsn;
+	serialized_snapshot.snapshotCsn = snapshot->snapshotCsn;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -1803,6 +1806,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1913,36 +1917,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
+		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
 
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
-
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..dfe80eaa0d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -249,7 +249,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..f8cdf573ae
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Mapping from XID to commit record's LSN (Commit Sequence Number).
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+						 TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif							/* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd..a7054fe11c 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr	latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b85b65c604..58ed0fc038 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a..240cbfd417 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -448,7 +439,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 20950ce033..19cb5f33bd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -27,37 +27,10 @@ extern PGDLLIMPORT bool ignore_invalid_pages;
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e0..c2156aca12 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -179,6 +179,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -215,6 +216,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 56af0b40b3..de74fce24e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888..1fda5b06f6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -181,6 +181,13 @@ typedef struct SnapshotData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.5

v5-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchtext/x-patch; charset=UTF-8; name=v5-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchDownload

From 0de32cdb87c84440516e4d6ee1ac2d55e2f2c3d2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:01:07 +0300
Subject: [PATCH v5 3/5] Make SnapBuildWaitSnapshot work without
 xl_running_xacts.xids array

SnapBuildWaitSnapshot looped through all the XIDs in the
xl_running_xacts, waiting for them to finish. Change it to grab the
list of running XIDs from the proc array instead. This removes the
last usage of the XIDs array in the xl_running_xacts record, allowing
it to be removed in the next commit.

When SnapBuildWaitSnapshot() is called with running->nextXid as the
'cutoff' point, the new code should wait for exactly the same set of
transactions as before. But when called with initial_xmin_horizon as
the 'cutoff', this might wait for more transactions than before: those
between running->nextXid and initial_xmin_horizon. For example,
imagine that we see a running-xacts record with nextXid 100, and
initial_xmin_horizon is 200. Before, we would wait for all XIDs < 100
to complete, and then log the standby snapshot and proceed, but now we
will wait for all XIDs < 200. I believe that's a good thing, because
we won't actually be able to move to the next state in the snapshot
building until all transactions < 200 have completed. The
running-xacts snapshot that we logged after waiting up to XID 100
would not be useful to us either, if there are still XIDs between 100
and 200 running.

SnapBuildWaitSnapshot() used to do useless work when called in a
standby, because in a standby, there are no XID locks and the
XactLockTableWait() calls returned immediately, even if the XIDs were
in fact still running in the primary. But as the comment says, the
waiting isn't necessary for correctness, so that was harmless. In any
case, stop doing the futile work on a standby.
---
 src/backend/replication/logical/snapbuild.c | 50 ++++++++++++++-------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 734865ce62..31da0832cc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -168,7 +168,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
-static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
+static void SnapBuildWaitSnapshot(TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -1222,14 +1222,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		NormalTransactionIdPrecedes(running->oldestRunningXid,
 									builder->initial_xmin_horizon))
 	{
+		TransactionId cutoff;
+
 		ereport(DEBUG1,
 				(errmsg_internal("skipping snapshot at %X/%X while building logical decoding snapshot, xmin horizon too low",
 								 LSN_FORMAT_ARGS(lsn)),
 				 errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 									builder->initial_xmin_horizon, running->oldestRunningXid)));
 
-
-		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+		cutoff = builder->initial_xmin_horizon;
+		TransactionIdRetreat(cutoff);
+		SnapBuildWaitSnapshot(cutoff);
 
 		return true;
 	}
@@ -1316,7 +1319,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1340,7 +1343,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1373,8 +1376,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 }
 
 /* ---
- * Iterate through xids in record, wait for all older than the cutoff to
- * finish.  Then, if possible, log a new xl_running_xacts record.
+ * Wait for all transactions older than or equal to the cutoff to finish.
+ * Then, if possible, log a new xl_running_xacts record.
  *
  * This isn't required for the correctness of decoding, but to:
  * a) allow isolationtester to notice that we're currently waiting for
@@ -1384,13 +1387,31 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  * ---
  */
 static void
-SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+SnapBuildWaitSnapshot(TransactionId cutoff)
 {
-	int			off;
+	RunningTransactions running;
+
+	if (RecoveryInProgress())
+	{
+		/*
+		 * During recovery, we have no mechanism for waiting for an XID to
+		 * finish, and we cannot create new running-xacts records either.
+		 */
+		return;
+	}
+
+	running = GetRunningTransactionData();
+
+	/*
+	 * GetRunningTransactionData returns with XidGenLock and ProcArrayLock
+	 * held, but we don't need them.
+	 */
+	LWLockRelease(XidGenLock);
+	LWLockRelease(ProcArrayLock);
 
-	for (off = 0; off < running->xcnt; off++)
+	for (int i = 0; i < running->xcnt; i++)
 	{
-		TransactionId xid = running->xids[off];
+		TransactionId xid = running->xids[i];
 
 		/*
 		 * Upper layers should prevent that we ever need to wait on ourselves.
@@ -1400,7 +1421,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 		if (TransactionIdIsCurrentTransactionId(xid))
 			elog(ERROR, "waiting for ourselves");
 
-		if (TransactionIdFollows(xid, cutoff))
+		if (TransactionIdFollowsOrEquals(xid, cutoff))
 			continue;
 
 		XactLockTableWait(xid, NULL, NULL, XLTW_None);
@@ -1412,10 +1433,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 	 * wait for bgwriter or checkpointer to log one.  During recovery we can't
 	 * enforce that, so we'll have to wait.
 	 */
-	if (!RecoveryInProgress())
-	{
-		LogStandbySnapshot();
-	}
+	LogStandbySnapshot();
 }
 
 #define SnapBuildOnDiskConstantSize \
-- 
2.39.5

v5-0004-Remove-the-now-unused-xids-array-from-xl_running_.patchtext/x-patch; charset=UTF-8; name=v5-0004-Remove-the-now-unused-xids-array-from-xl_running_.patchDownload

From 2b8672e020b9e1701ff2a16d18313ad8d37cb79e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 16:40:57 +0300
Subject: [PATCH v5 4/5] Remove the now-unused xids array from xl_running_xacts

We still generate running-xacts records, because they are still needed
to initialize the snapshot in logical decoding.
---
 src/backend/access/rmgrdesc/standbydesc.c   | 18 ------------
 src/backend/replication/logical/snapbuild.c |  8 +++---
 src/backend/storage/ipc/standby.c           | 32 +++++----------------
 src/include/storage/standby.h               |  2 --
 src/include/storage/standbydefs.h           | 16 +++++++----
 5 files changed, 21 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 32e509a400..99f08beb4a 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -19,28 +19,10 @@
 static void
 standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
-	int			i;
-
 	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
 					 xlrec->oldestRunningXid);
-	if (xlrec->xcnt > 0)
-	{
-		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
-		for (i = 0; i < xlrec->xcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[i]);
-	}
-
-	if (xlrec->subxid_overflow)
-		appendStringInfoString(buf, "; subxid overflowed");
-
-	if (xlrec->subxcnt > 0)
-	{
-		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
-		for (i = 0; i < xlrec->subxcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
-	}
 }
 
 void
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 31da0832cc..cac3ffe577 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1316,8 +1316,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial starting point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
@@ -1340,8 +1340,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial consistent point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index e02c9ab842..6ed46bed03 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1337,9 +1337,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xl_running_xacts xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.xcnt = CurrRunningXacts->xcnt;
-	xlrec.subxcnt = CurrRunningXacts->subxcnt;
-	xlrec.subxid_overflow = (CurrRunningXacts->subxid_status != SUBXIDS_IN_ARRAY);
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
@@ -1347,31 +1344,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	/* Header */
 	XLogBeginInsert();
 	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData((char *) (&xlrec), MinSizeOfXactRunningXacts);
-
-	/* array of TransactionIds */
-	if (xlrec.xcnt > 0)
-		XLogRegisterData((char *) CurrRunningXacts->xids,
-						 (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
+	XLogRegisterData((char *) (&xlrec), SizeOfXactRunningXacts);
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
-	if (xlrec.subxid_overflow)
-		elog(DEBUG2,
-			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
-	else
-		elog(DEBUG2,
-			 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+	elog(DEBUG2,
+		 "logging running transaction bounds (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+		 LSN_FORMAT_ARGS(recptr),
+		 CurrRunningXacts->oldestRunningXid,
+		 CurrRunningXacts->latestCompletedXid,
+		 CurrRunningXacts->nextXid);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index cce0bc521e..9d5a298a39 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -60,8 +60,6 @@ extern void StandbyReleaseLockTree(TransactionId xid,
 extern void StandbyReleaseAllLocks(void);
 extern void StandbyReleaseOldLocks(TransactionId oldxid);
 
-#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
-
 
 /*
  * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index fe12f463a8..d858209447 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -42,20 +42,24 @@ typedef struct xl_standby_locks
 } xl_standby_locks;
 
 /*
- * When we write running xact data to WAL, we use this structure.
+ * Data included in an XLOG_RUNNING_XACTS record.
+ *
+ * This used to include a list of running XIDs, hence the name, but nowadays
+ * this only contains the min and max bounds of the transactions that were
+ * running when the record was written.  They are needed to initialize logical
+ * decoding.  They are also used in hot standby to prune information about old
+ * running transactions, in case the the primary didn't write a COMMIT/ABORT
+ * record for some reason.
  */
 typedef struct xl_running_xacts
 {
-	int			xcnt;			/* # of xact ids in xids[] */
-	int			subxcnt;		/* # of subxact ids in xids[] */
-	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* xid from TransamVariables->nextXid */
 	TransactionId oldestRunningXid; /* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
-
-	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
 
+#define SizeOfXactRunningXacts sizeof(xl_running_xacts)
+
 /*
  * Invalidations for standby, currently only when transactions without an
  * assigned xid commit.
-- 
2.39.5

v5-0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patchtext/x-patch; charset=UTF-8; name=v5-0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patchDownload

From 67ad6b0b77df83d6ea95ab2b921540e2288be372 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 3 Dec 2024 15:42:03 +0200
Subject: [PATCH v5 5/5] Add a cache to Snapshot to avoid repeated CSN lookups

Cache the status of all XIDs that have been looked up in the CSN log
in the SnapshotData. This avoids having to go the CSN log in the
common case that the same XIDs are looked up over and over again.
---
 src/backend/utils/time/snapmgr.c | 92 ++++++++++++++++++++++++++++++--
 src/include/utils/snapshot.h     | 10 +++-
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index da82def846..df9e8ba37f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -66,6 +66,35 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Define a radix tree implementation to cache CSN lookups in a snapshot.
+ *
+ * We need only one bit of information for each XID stored in the cache: was
+ * the XID running or not.  However, the radix tree implementation uses 8
+ * bytes for each entry (on 64-bit machines) even if the value type is smaller
+ * than that.  To reduce memory usage, we use uint64 as the value type, and
+ * store multiple XIDs in each value.
+ *
+ * The 64-bit value word holds two bits for each XID: whether the XID is
+ * present in the cache or not, and if it's present, whether it's considered
+ * as in-progress by the snapshot or not.  So each entry in the radix tree
+ * holds the status for 32 XIDs.
+ */
+#define RT_PREFIX inprogress_cache
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define INPROGRESS_CACHE_BITS 2
+#define INPROGRESS_CACHE_XIDS_PER_WORD 32
+
+#define INPROGRESS_CACHE_XID_IS_CACHED(word, slotno) \
+	((((word) & (UINT64CONST(1) << (slotno)))) != 0)
+
+#define INPROGRESS_CACHE_XID_IS_IN_PROGRESS(word, slotno) \
+	((((word) & (UINT64CONST(1) << ((slotno) + 1)))) != 0)
 
 /*
  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
@@ -595,6 +624,12 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->copied = true;
 	newsnap->snapXactCompletionCount = 0;
 
+	/*
+	 * TODO: If we had a separate reference count on the cache, we could share
+	 * it between the copies.
+	 */
+	newsnap->inprogress_cache = NULL;
+
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
 	{
@@ -609,7 +644,7 @@ CopySnapshot(Snapshot snapshot)
 	 * Setup subXID array. Don't bother to copy it if it had overflowed,
 	 * though, because it's not used anywhere in that case. Except if it's a
 	 * snapshot taken during recovery; all the top-level XIDs are in subxip as
-	 * well in that case, so we mustn't lose them.
+	 * well in that case, so we mustn't lose them. XXX
 	 */
 	if (snapshot->subxcnt > 0 &&
 		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
@@ -635,6 +670,8 @@ FreeSnapshot(Snapshot snapshot)
 	Assert(snapshot->active_count == 0);
 	Assert(snapshot->copied);
 
+	if (snapshot->inprogress_cache)
+		inprogress_cache_free(snapshot->inprogress_cache);
 	pfree(snapshot);
 }
 
@@ -1807,6 +1844,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
 	snapshot->snapshotCsn = serialized_snapshot.snapshotCsn;
+	snapshot->inprogress_cache = NULL;
 	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
@@ -1917,12 +1955,56 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 	}
 	else
 	{
-		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
+		XLogRecPtr	csn;
+		bool		inprogress;
+		uint64	   *cache_entry = NULL;
+		uint64		cache_word = 0;
+
+		/*
+		 * Calculate the word and bit slot for the XID in the cache. We use an
+		 * offset from xmax as the key instead of the XID directly, because
+		 * the radix tree can compact away leading zeros and is thus slightly
+		 * more efficient with keys closer to 0.
+		 */
+		uint32		cache_idx = snapshot->xmax - xid;
+		uint64		wordno = cache_idx / INPROGRESS_CACHE_XIDS_PER_WORD;
+		uint64		slotno = (cache_idx % INPROGRESS_CACHE_XIDS_PER_WORD) * INPROGRESS_CACHE_BITS;
+
+		if (snapshot->inprogress_cache)
+		{
+			cache_entry = inprogress_cache_find(snapshot->inprogress_cache, wordno);
+			if (cache_entry)
+			{
+				cache_word = *cache_entry;
+				if (INPROGRESS_CACHE_XID_IS_CACHED(cache_word, slotno))
+					return INPROGRESS_CACHE_XID_IS_IN_PROGRESS(cache_word, slotno);
+			}
+		}
+
+		/* Not found in cache, look up the CSN */
+		csn = CSNLogGetCSNByXid(xid);
+		inprogress = (csn == InvalidXLogRecPtr || csn > snapshot->snapshotCsn);
 
-		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
-			return false;
+		/* Update the cache word, and store it back to the radix tree */
+		cache_word |= UINT64CONST(1) << slotno;	/* cached */
+		if (inprogress)
+			cache_word |= UINT64CONST(1) << (slotno + 1);	/* in-progress */
+
+		if (!snapshot->inprogress_cache)
+		{
+			MemoryContext cache_ctx;
+
+			cache_ctx = AllocSetContextCreate(TopTransactionContext,
+											  "snapshot inprogress cache context",
+											  ALLOCSET_SMALL_SIZES);
+			snapshot->inprogress_cache = inprogress_cache_create(cache_ctx);
+		}
+		if (cache_entry)
+			*cache_entry = cache_word;
 		else
-			return true;
+			inprogress_cache_set(snapshot->inprogress_cache, wordno, &cache_word);
+
+		return inprogress;
 	}
 
 	return false;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 1fda5b06f6..3fb7572879 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -122,6 +122,8 @@ typedef struct SnapshotData *Snapshot;
 
 #define InvalidSnapshot		((Snapshot) NULL)
 
+struct inprogress_cache_radix_tree; /* private to snapmgr.c */
+
 /*
  * Struct representing all kind of possible snapshots.
  *
@@ -158,7 +160,7 @@ typedef struct SnapshotData
 	TransactionId xmax;			/* all XID >= xmax are invisible to me */
 
 	/*
-	 * For normal MVCC snapshot this contains the all xact IDs that are in
+	 * For normal MVCC snapshot this contains all the xact IDs that are in
 	 * progress, unless the snapshot was taken during recovery in which case
 	 * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
 	 * it contains *committed* transactions between xmin and xmax.
@@ -188,6 +190,12 @@ typedef struct SnapshotData
 	 */
 	XLogRecPtr	snapshotCsn;
 
+	/*
+	 * Cache of XIDs known to be running or not according to the snapshot.
+	 * Used in snapshots taken during recovery.
+	 */
+	struct inprogress_cache_radix_tree *inprogress_cache;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
 
-- 
2.39.5

#12

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: Heikki Linnakangas (#11)

Re: CSN snapshots in hot standby

On Tue, Dec 3, 2024 at 9:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 20/11/2024 15:33, John Naylor wrote:
I did find one weird thing that makes a big difference: I originally
used AllocSetContextCreate(..., ALLOCSET_DEFAULT_SIZES) for the radix
tree's memory context. With that, XidInMVCCSnapshot() takes about 19% of
the CPU time in that test. When I changed that to ALLOCSET_SMALL_SIZES,
it falls down to the 4% figure. And weird enough, in both cases the time
seems to be spent in the malloc() call from SlabContextCreate(), not
AllocSetContextCreate(). I think doing this particular mix of large and
small allocations with malloc() somehow poisons its free list or
something. So this is probably heavily dependent on the malloc()
implementation. In any case, ALLOCSET_SMALL_SIZES is clearly a better
choice here, even without that effect.

Hmm, interesting. That passed context is needed for 4 things:
1. allocated values (not used here for 64-bit, and 32-bit could be
made to work the same way)
2. iteration state (not used here)
3. a convenient place to put slab child contexts so we can free them easily
4. a place to put the "control object" -- this is really only needed
for shared memory and I have a personal todo to embed it rather than
allocate it for the local memory case.

Removing the need for a passed context for callers that don't need it
is additional possible future work.

Anyway, 0005 looks good to me.

--
John Naylor
Amazon Web Services

#13

Heikki Linnakangas

hlinnaka@iki.fi

10 months ago

In reply to: Heikki Linnakangas (#11)

12 attachment(s)

Re: CSN snapshots in hot standby

Here's a new patchset version. Not much has changed in the actual CSN
patches. But I spent a lot of time refactoring the snapshot management
code, so that there is a simple place to add the "inprogress XID cache"
for the CSN snapshots, in a way that avoids duplicating the cache if a
snapshot is copied around.

Patches 0001-0002 are the patches I posted on a separate thread earlier.
See
/messages/by-id/ec10d398-c9b3-4542-8095-5fc6408b17d1@iki.fi.

Patches 0003-0006 contain more snapshot manager changes. The end state
is that an MVCC snapshot consists of two structs: a shared "inner"
struct that contains xmin, xmax and the XID lists, and an "outer" struct
that contains a pointer to the shared struct and the current command ID.
As a snapshot is copied around, all the copies share the same shared,
reference-counted struct.

The rest of the patches are the same CSN patches I posted before,
rebased over the snapshot manager changes.

There's one thing that hasn't been discussed yet: The
ProcArrayRecoveryEndTransaction() function, which replaces
ExpireTreeKnownAssignedTransactionIds() and is called on replay of every
commit/abort record, does this:

/*
* If this was the oldest XID that was still running, advance it. This is
* important for advancing the global xmin, which avoids unnecessary
* recovery conflicts
*
* No locking required because this runs in the startup process.
*
* XXX: the caller actually has a list of XIDs that just committed. We
* could save some clog lookups by taking advantage of that list.
*/
oldest_running_primary_xid = procArray->oldest_running_primary_xid;
while (oldest_running_primary_xid < max_xid)
{
if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
!TransactionIdDidAbort(oldest_running_primary_xid))
{
break;
}
TransactionIdAdvance(oldest_running_primary_xid);
}
if (max_xid == oldest_running_primary_xid)
TransactionIdAdvance(oldest_running_primary_xid);

The point is to maintain an "oldest xmin" value based on the WAL records
that are being replayed. Whenever the currently oldest running XID
finishes, we scan the CLOG to find the next oldest XID that hasn't
completed yet.

That adds approximately one or two CLOG lookup to every commit record
replay on average. I haven't tried measuring that, but it seems like it
could slow down recovery. There are ways that could be improved. For
example, do it in larger batches.

A bunch of other small XXX comments remain, but they're just markers for
comments that need to be adjusted, or for further cleanups that are now
possible.

There are also several ways the inprogress cache could be made more
efficient, which I haven't explored:

- For each XID in the cache radix tree, we store one bit to indicate
whether the lookup has been performed, i.e. if the cache is valid for
the XID, and another bit to indicate if the XID is visible or not. With
64-bit cache words stored in the radix tree, each cache word can store
the status of 32 transactions. It would probably be better to work in
bigger chunks. For example, when doing a lookup in the cache, check the
status of 64 transactions at once. Assuming they're all stored on the
same CSN page, it would not be much more expensive than a single XID
lookup. That would make the cache 2x more compact, and save on future
lookups of XIDS falling on the same cache word.

- Initializing the radix tree cache is fairly expensive, with several
memory allocations. Many of those allocations could be done lazily with
some effort in radixtree.h.

- Or start the cache as a small array of XIDs, and switch to the radix
tree only after it fills up.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v6-0001-Split-SnapshotData-into-separate-structs-for-each.patchtext/x-patch; charset=UTF-8; name=v6-0001-Split-SnapshotData-into-separate-structs-for-each.patchDownload

From c2b5bc5f1f2cd959c695a91bd2eec047440426fc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 20 Dec 2024 00:36:33 +0200
Subject: [PATCH v6 01/12] Split SnapshotData into separate structs for each
 kind of snapshot

The SnapshotData fields were repurposed for different uses depending
the kind of snapshot. Split it into separate structs for different
kinds of snapshots, so that it is more clear which fields are used
with which snapshot kind, and the fields can have more descriptive
names.
---
 contrib/amcheck/verify_heapam.c               |   2 +-
 contrib/amcheck/verify_nbtree.c               |   2 +-
 src/backend/access/heap/heapam.c              |   3 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/heap/heapam_visibility.c   |  24 +--
 src/backend/access/index/indexam.c            |  11 +-
 src/backend/access/nbtree/nbtinsert.c         |   4 +-
 src/backend/access/spgist/spgvacuum.c         |   2 +-
 src/backend/access/table/tableam.c            |   8 +-
 src/backend/access/transam/parallel.c         |  14 +-
 src/backend/catalog/pg_inherits.c             |   2 +-
 src/backend/commands/async.c                  |   4 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/commands/tablecmds.c              |   2 +-
 src/backend/executor/execIndexing.c           |   4 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/partitioning/partdesc.c           |   2 +-
 src/backend/replication/logical/decode.c      |   2 +-
 src/backend/replication/logical/origin.c      |   4 +-
 .../replication/logical/reorderbuffer.c       | 114 +++++-----
 src/backend/replication/logical/snapbuild.c   | 114 +++++-----
 src/backend/replication/walsender.c           |   2 +-
 src/backend/storage/ipc/procarray.c           |   6 +-
 src/backend/storage/lmgr/predicate.c          |  32 +--
 src/backend/utils/adt/xid8funcs.c             |   4 +-
 src/backend/utils/time/snapmgr.c              | 198 +++++++++++-------
 src/include/access/heapam.h                   |   2 +-
 src/include/access/relscan.h                  |   6 +-
 src/include/replication/reorderbuffer.h       |  12 +-
 src/include/replication/snapbuild.h           |   6 +-
 src/include/replication/snapbuild_internal.h  |   2 +-
 src/include/storage/predicate.h               |   4 +-
 src/include/storage/procarray.h               |   2 +-
 src/include/utils/snapmgr.h                   |  16 +-
 src/include/utils/snapshot.h                  | 155 +++++++++-----
 src/tools/pgindent/typedefs.list              |   4 +
 36 files changed, 451 insertions(+), 336 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 1970fc8620a..6665cafc179 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -310,7 +310,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
 	 * while we're running.
 	 */
-	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+	ctx.safe_xmin = GetTransactionSnapshot()->mvcc.xmin;
 
 	/*
 	 * If we report corruption when not examining some individual attribute,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..e90b4a2ad5a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -458,7 +458,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 */
 			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
 				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
+									   snapshot->mvcc.xmin))
 				ereport(ERROR,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6e433db039e..0cfa100cbd1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -605,7 +605,8 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	 * full page write. Until we can prove that beyond doubt, let's check each
 	 * tuple for visibility the hard way.
 	 */
-	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
+	all_visible = PageIsAllVisible(page) &&
+		(snapshot->snapshot_type != SNAPSHOT_MVCC || !snapshot->mvcc.takenDuringRecovery);
 	check_serializable =
 		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..fce657f00f6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -390,7 +390,7 @@ tuple_lock_retry:
 
 		if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
 		{
-			SnapshotData SnapshotDirty;
+			DirtySnapshotData SnapshotDirty;
 			TransactionId priorXmax;
 
 			/* it was updated, so look at the updated version */
@@ -415,7 +415,7 @@ tuple_lock_retry:
 							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 				tuple->t_self = *tid;
-				if (heap_fetch(relation, &SnapshotDirty, tuple, &buffer, true))
+				if (heap_fetch(relation, (Snapshot) &SnapshotDirty, tuple, &buffer, true))
 				{
 					/*
 					 * If xmin isn't what we're expecting, the slot must have
@@ -2308,7 +2308,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
 
 	page = (Page) BufferGetPage(hscan->rs_cbuf);
 	all_visible = PageIsAllVisible(page) &&
-		!scan->rs_snapshot->takenDuringRecovery;
+		(scan->rs_snapshot->snapshot_type != SNAPSHOT_MVCC || !scan->rs_snapshot->mvcc.takenDuringRecovery);
 	maxoffset = PageGetMaxOffsetNumber(page);
 
 	for (;;)
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..f5d69b558f1 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -740,7 +740,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
  * token is also returned in snapshot->speculativeToken.
  */
 static bool
-HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesDirty(HeapTuple htup, DirtySnapshotData *snapshot,
 						Buffer buffer)
 {
 	HeapTupleHeader tuple = htup->t_data;
@@ -957,7 +957,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * and more contention on ProcArrayLock.
  */
 static bool
-HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 					   Buffer buffer)
 {
 	HeapTupleHeader tuple = htup->t_data;
@@ -1435,7 +1435,7 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
  *	snapshot->vistest must have been set up with the horizon to use.
  */
 static bool
-HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesNonVacuumable(HeapTuple htup, NonVacuumableSnapshotData *snapshot,
 								Buffer buffer)
 {
 	TransactionId dead_after = InvalidTransactionId;
@@ -1593,7 +1593,7 @@ TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
  * complicated than when dealing "only" with the present.
  */
 static bool
-HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, HistoricMVCCSnapshot snapshot,
 							   Buffer buffer)
 {
 	HeapTupleHeader tuple = htup->t_data;
@@ -1610,7 +1610,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return false;
 	}
 	/* check if it's one of our txids, toplevel is also in there */
-	else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+	else if (TransactionIdInArray(xmin, snapshot->curxip, snapshot->curxcnt))
 	{
 		bool		resolved;
 		CommandId	cmin = HeapTupleHeaderGetRawCommandId(tuple);
@@ -1669,7 +1669,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return false;
 	}
 	/* check if it's a committed transaction in [xmin, xmax) */
-	else if (TransactionIdInArray(xmin, snapshot->xip, snapshot->xcnt))
+	else if (TransactionIdInArray(xmin, snapshot->committed_xids, snapshot->xcnt))
 	{
 		/* fall through */
 	}
@@ -1702,7 +1702,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 	}
 
 	/* check if it's one of our txids, toplevel is also in there */
-	if (TransactionIdInArray(xmax, snapshot->subxip, snapshot->subxcnt))
+	if (TransactionIdInArray(xmax, snapshot->curxip, snapshot->curxcnt))
 	{
 		bool		resolved;
 		CommandId	cmin;
@@ -1755,7 +1755,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 	else if (TransactionIdFollowsOrEquals(xmax, snapshot->xmax))
 		return true;
 	/* xmax is between [xmin, xmax), check known committed array */
-	else if (TransactionIdInArray(xmax, snapshot->xip, snapshot->xcnt))
+	else if (TransactionIdInArray(xmax, snapshot->committed_xids, snapshot->xcnt))
 		return false;
 	/* xmax is between [xmin, xmax), but known not to have committed yet */
 	else
@@ -1778,7 +1778,7 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 	switch (snapshot->snapshot_type)
 	{
 		case SNAPSHOT_MVCC:
-			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesMVCC(htup, &snapshot->mvcc, buffer);
 		case SNAPSHOT_SELF:
 			return HeapTupleSatisfiesSelf(htup, snapshot, buffer);
 		case SNAPSHOT_ANY:
@@ -1786,11 +1786,11 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 		case SNAPSHOT_TOAST:
 			return HeapTupleSatisfiesToast(htup, snapshot, buffer);
 		case SNAPSHOT_DIRTY:
-			return HeapTupleSatisfiesDirty(htup, snapshot, buffer);
+			return HeapTupleSatisfiesDirty(htup, &snapshot->dirty, buffer);
 		case SNAPSHOT_HISTORIC_MVCC:
-			return HeapTupleSatisfiesHistoricMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesHistoricMVCC(htup, &snapshot->historic_mvcc, buffer);
 		case SNAPSHOT_NON_VACUUMABLE:
-			return HeapTupleSatisfiesNonVacuumable(htup, snapshot, buffer);
+			return HeapTupleSatisfiesNonVacuumable(htup, &snapshot->nonvacuumable, buffer);
 	}
 
 	return false;				/* keep compiler quiet */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 55ec4c10352..769170a37d5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -469,7 +469,7 @@ index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
 	RELATION_CHECKS;
 
 	nbytes = offsetof(ParallelIndexScanDescData, ps_snapshot_data);
-	nbytes = add_size(nbytes, EstimateSnapshotSpace(snapshot));
+	nbytes = add_size(nbytes, EstimateSnapshotSpace(&snapshot->mvcc));
 	nbytes = MAXALIGN(nbytes);
 
 	if (instrument)
@@ -517,16 +517,17 @@ index_parallelscan_initialize(Relation heapRelation, Relation indexRelation,
 	Assert(instrument || parallel_aware);
 
 	RELATION_CHECKS;
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 
 	offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
-					  EstimateSnapshotSpace(snapshot));
+					  EstimateSnapshotSpace((MVCCSnapshot) snapshot));
 	offset = MAXALIGN(offset);
 
 	target->ps_locator = heapRelation->rd_locator;
 	target->ps_indexlocator = indexRelation->rd_locator;
 	target->ps_offset_ins = 0;
 	target->ps_offset_am = 0;
-	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+	SerializeSnapshot((MVCCSnapshot) snapshot, target->ps_snapshot_data);
 
 	if (instrument)
 	{
@@ -590,8 +591,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
 
-	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
-	RegisterSnapshot(snapshot);
+	snapshot = (Snapshot) RestoreSnapshot(pscan->ps_snapshot_data);
+	snapshot = RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
 									pscan, true);
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index aa82cede30a..714e4ee3f0b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -413,7 +413,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	IndexTuple	curitup = NULL;
 	ItemId		curitemid = NULL;
 	BTScanInsert itup_key = insertstate->itup_key;
-	SnapshotData SnapshotDirty;
+	DirtySnapshotData SnapshotDirty;
 	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
@@ -558,7 +558,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 * index entry for the entire chain.
 				 */
 				else if (table_index_fetch_tuple_check(heapRel, &htid,
-													   &SnapshotDirty,
+													   (Snapshot) &SnapshotDirty,
 													   &all_dead))
 				{
 					TransactionId xwait;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index b3df2d89074..850ad36cd0a 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -808,7 +808,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
+	bds->myXmin = GetActiveSnapshot()->mvcc.xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..4eb81e40d99 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -133,7 +133,7 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 	Size		sz = 0;
 
 	if (IsMVCCSnapshot(snapshot))
-		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
+		sz = add_size(sz, EstimateSnapshotSpace((MVCCSnapshot) snapshot));
 	else
 		Assert(snapshot == SnapshotAny);
 
@@ -152,7 +152,7 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
 
 	if (IsMVCCSnapshot(snapshot))
 	{
-		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
+		SerializeSnapshot((MVCCSnapshot) snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
 	}
 	else
@@ -174,8 +174,8 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 	if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
-		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
-		RegisterSnapshot(snapshot);
+		snapshot = (Snapshot) RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+		snapshot = RegisterSnapshot(snapshot);
 		flags |= SO_TEMP_SNAPSHOT;
 	}
 	else
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..8046e14abf7 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -275,10 +275,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_estimate_chunk(&pcxt->estimator, combocidlen);
 		if (IsolationUsesXactSnapshot())
 		{
-			tsnaplen = EstimateSnapshotSpace(transaction_snapshot);
+			tsnaplen = EstimateSnapshotSpace((MVCCSnapshot) transaction_snapshot);
 			shm_toc_estimate_chunk(&pcxt->estimator, tsnaplen);
 		}
-		asnaplen = EstimateSnapshotSpace(active_snapshot);
+		asnaplen = EstimateSnapshotSpace((MVCCSnapshot) active_snapshot);
 		shm_toc_estimate_chunk(&pcxt->estimator, asnaplen);
 		tstatelen = EstimateTransactionStateSpace();
 		shm_toc_estimate_chunk(&pcxt->estimator, tstatelen);
@@ -400,14 +400,14 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		if (IsolationUsesXactSnapshot())
 		{
 			tsnapspace = shm_toc_allocate(pcxt->toc, tsnaplen);
-			SerializeSnapshot(transaction_snapshot, tsnapspace);
+			SerializeSnapshot((MVCCSnapshot) transaction_snapshot, tsnapspace);
 			shm_toc_insert(pcxt->toc, PARALLEL_KEY_TRANSACTION_SNAPSHOT,
 						   tsnapspace);
 		}
 
 		/* Serialize the active snapshot. */
 		asnapspace = shm_toc_allocate(pcxt->toc, asnaplen);
-		SerializeSnapshot(active_snapshot, asnapspace);
+		SerializeSnapshot((MVCCSnapshot) active_snapshot, asnapspace);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ACTIVE_SNAPSHOT, asnapspace);
 
 		/* Provide the handle for per-session segment. */
@@ -1493,9 +1493,9 @@ ParallelWorkerMain(Datum main_arg)
 	 */
 	asnapspace = shm_toc_lookup(toc, PARALLEL_KEY_ACTIVE_SNAPSHOT, false);
 	tsnapspace = shm_toc_lookup(toc, PARALLEL_KEY_TRANSACTION_SNAPSHOT, true);
-	asnapshot = RestoreSnapshot(asnapspace);
-	tsnapshot = tsnapspace ? RestoreSnapshot(tsnapspace) : asnapshot;
-	RestoreTransactionSnapshot(tsnapshot,
+	asnapshot = (Snapshot) RestoreSnapshot(asnapspace);
+	tsnapshot = tsnapspace ? (Snapshot) RestoreSnapshot(tsnapspace) : asnapshot;
+	RestoreTransactionSnapshot((MVCCSnapshot) tsnapshot,
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 929bb53b620..b658601bf77 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -148,7 +148,7 @@ find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
 				xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
 				snap = GetActiveSnapshot();
 
-				if (!XidInMVCCSnapshot(xmin, snap))
+				if (!XidInMVCCSnapshot(xmin, (MVCCSnapshot) snap))
 				{
 					if (detached_xmin)
 					{
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..1ffb6f5fa70 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2022,6 +2022,8 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	bool		reachedEndOfPage;
 	AsyncQueueEntry *qe;
 
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
 	do
 	{
 		QueuePosition thisentry = *current;
@@ -2041,7 +2043,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 		/* Ignore messages destined for other databases */
 		if (qe->dboid == MyDatabaseId)
 		{
-			if (XidInMVCCSnapshot(qe->xid, snapshot))
+			if (XidInMVCCSnapshot(qe->xid, (MVCCSnapshot) snapshot))
 			{
 				/*
 				 * The source transaction is still in progress, so we can't
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 33c2106c17c..da3e02398bb 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1761,7 +1761,7 @@ DefineIndex(Oid tableId,
 	 * they must wait for.  But first, save the snapshot's xmin to use as
 	 * limitXmin for GetCurrentVirtualXIDs().
 	 */
-	limitXmin = snapshot->xmin;
+	limitXmin = snapshot->mvcc.xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
@@ -4156,7 +4156,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * We can now do away with our active snapshot, we still need to save
 		 * the xmin limit to wait for older snapshots.
 		 */
-		limitXmin = snapshot->xmin;
+		limitXmin = snapshot->mvcc.xmin;
 
 		PopActiveSnapshot();
 		UnregisterSnapshot(snapshot);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 10624353b0a..c55b5a7a014 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -20797,7 +20797,7 @@ ATExecDetachPartitionFinalize(Relation rel, RangeVar *name)
 	 * all such queries are complete (otherwise we would present them with an
 	 * inconsistent view of catalogs).
 	 */
-	WaitForOlderSnapshots(snap->xmin, false);
+	WaitForOlderSnapshots(snap->mvcc.xmin, false);
 
 	DetachPartitionFinalize(rel, partRel, true, InvalidOid);
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index e3fe9b78bb5..a3955792729 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -717,7 +717,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(index);
 	IndexScanDesc index_scan;
 	ScanKeyData scankeys[INDEX_MAX_KEYS];
-	SnapshotData DirtySnapshot;
+	DirtySnapshotData DirtySnapshot;
 	int			i;
 	bool		conflict;
 	bool		found_self;
@@ -816,7 +816,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, (Snapshot) &DirtySnapshot, NULL, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index ede89ea3cf9..84aa7c3268c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -184,7 +184,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	ScanKeyData skey[INDEX_MAX_KEYS];
 	int			skey_attoff;
 	IndexScanDesc scan;
-	SnapshotData snap;
+	DirtySnapshotData snap;
 	TransactionId xwait;
 	Relation	idxrel;
 	bool		found;
@@ -202,7 +202,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, (Snapshot) &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -357,7 +357,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
 {
 	TupleTableSlot *scanslot;
 	TableScanDesc scan;
-	SnapshotData snap;
+	DirtySnapshotData snap;
 	TypeCacheEntry **eq;
 	TransactionId xwait;
 	bool		found;
@@ -369,7 +369,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
 
 	/* Start a heap scan. */
 	InitDirtySnapshot(snap);
-	scan = table_beginscan(rel, &snap, 0, NULL);
+	scan = table_beginscan(rel, (Snapshot) &snap, 0, NULL);
 	scanslot = table_slot_create(rel, NULL);
 
 retry:
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 328b4d450e4..7c15c634181 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -102,7 +102,7 @@ RelationGetPartitionDesc(Relation rel, bool omit_detached)
 		Assert(TransactionIdIsValid(rel->rd_partdesc_nodetached_xmin));
 		activesnap = GetActiveSnapshot();
 
-		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, activesnap))
+		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, &activesnap->mvcc))
 			return rel->rd_partdesc_nodetached;
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..6a428e9720e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -586,7 +586,7 @@ logicalmsg_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	TransactionId xid = XLogRecGetXid(r);
 	uint8		info = XLogRecGetInfo(r) & ~XLR_INFO_MASK;
 	RepOriginId origin_id = XLogRecGetOrigin(r);
-	Snapshot	snapshot = NULL;
+	HistoricMVCCSnapshot snapshot = NULL;
 	xl_logical_message *message;
 
 	if (info != XLOG_LOGICAL_MESSAGE)
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 6583dd497da..51fc6460251 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -260,7 +260,7 @@ replorigin_create(const char *roname)
 	HeapTuple	tuple = NULL;
 	Relation	rel;
 	Datum		roname_d;
-	SnapshotData SnapshotDirty;
+	DirtySnapshotData SnapshotDirty;
 	SysScanDesc scan;
 	ScanKeyData key;
 
@@ -302,7 +302,7 @@ replorigin_create(const char *roname)
 
 		scan = systable_beginscan(rel, ReplicationOriginIdentIndex,
 								  true /* indexOK */ ,
-								  &SnapshotDirty,
+								  (Snapshot) &SnapshotDirty,
 								  1, &key);
 
 		collides = HeapTupleIsValid(systable_getnext(scan));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 977fbcd2474..e8196a8d5d5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -268,9 +268,9 @@ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
 
-static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
-static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
-									  ReorderBufferTXN *txn, CommandId cid);
+static void ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap);
+static HistoricMVCCSnapshot ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
+												  ReorderBufferTXN *txn, CommandId cid);
 
 /*
  * ---------------------------------------
@@ -852,7 +852,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
  */
 void
 ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
-						  Snapshot snap, XLogRecPtr lsn,
+						  HistoricMVCCSnapshot snap, XLogRecPtr lsn,
 						  bool transactional, const char *prefix,
 						  Size message_size, const char *message)
 {
@@ -886,7 +886,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 	else
 	{
 		ReorderBufferTXN *txn = NULL;
-		volatile Snapshot snapshot_now = snap;
+		volatile	HistoricMVCCSnapshot snapshot_now = snap;
 
 		/* Non-transactional changes require a valid snapshot. */
 		Assert(snapshot_now);
@@ -1886,55 +1886,55 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * that catalog modifying transactions can look into intermediate catalog
  * states.
  */
-static Snapshot
-ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
+static HistoricMVCCSnapshot
+ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 					  ReorderBufferTXN *txn, CommandId cid)
 {
-	Snapshot	snap;
+	HistoricMVCCSnapshot snap;
 	dlist_iter	iter;
 	int			i = 0;
 	Size		size;
 
-	size = sizeof(SnapshotData) +
+	size = sizeof(HistoricMVCCSnapshotData) +
 		sizeof(TransactionId) * orig_snap->xcnt +
 		sizeof(TransactionId) * (txn->nsubtxns + 1);
 
 	snap = MemoryContextAllocZero(rb->context, size);
-	memcpy(snap, orig_snap, sizeof(SnapshotData));
+	memcpy(snap, orig_snap, sizeof(HistoricMVCCSnapshotData));
 
 	snap->copied = true;
-	snap->active_count = 1;		/* mark as active so nobody frees it */
+	snap->refcount = 1;			/* mark as active so nobody frees it */
 	snap->regd_count = 0;
-	snap->xip = (TransactionId *) (snap + 1);
+	snap->committed_xids = (TransactionId *) (snap + 1);
 
-	memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
+	memcpy(snap->committed_xids, orig_snap->committed_xids, sizeof(TransactionId) * snap->xcnt);
 
 	/*
-	 * snap->subxip contains all txids that belong to our transaction which we
+	 * snap->curxip contains all txids that belong to our transaction which we
 	 * need to check via cmin/cmax. That's why we store the toplevel
 	 * transaction in there as well.
 	 */
-	snap->subxip = snap->xip + snap->xcnt;
-	snap->subxip[i++] = txn->xid;
+	snap->curxip = snap->committed_xids + snap->xcnt;
+	snap->curxip[i++] = txn->xid;
 
 	/*
 	 * txn->nsubtxns isn't decreased when subtransactions abort, so count
 	 * manually. Since it's an upper boundary it is safe to use it for the
 	 * allocation above.
 	 */
-	snap->subxcnt = 1;
+	snap->curxcnt = 1;
 
 	dlist_foreach(iter, &txn->subtxns)
 	{
 		ReorderBufferTXN *sub_txn;
 
 		sub_txn = dlist_container(ReorderBufferTXN, node, iter.cur);
-		snap->subxip[i++] = sub_txn->xid;
-		snap->subxcnt++;
+		snap->curxip[i++] = sub_txn->xid;
+		snap->curxcnt++;
 	}
 
 	/* sort so we can bsearch() later */
-	qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
+	qsort(snap->curxip, snap->curxcnt, sizeof(TransactionId), xidComparator);
 
 	/* store the specified current CommandId */
 	snap->curcid = cid;
@@ -1946,7 +1946,7 @@ ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
  * Free a previously ReorderBufferCopySnap'ed snapshot
  */
 static void
-ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
+ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap)
 {
 	if (snap->copied)
 		pfree(snap);
@@ -2099,7 +2099,7 @@ ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
  */
 static inline void
 ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
-							 Snapshot snapshot_now, CommandId command_id)
+							 HistoricMVCCSnapshot snapshot_now, CommandId command_id)
 {
 	txn->command_id = command_id;
 
@@ -2144,7 +2144,7 @@ ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
  */
 static void
 ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-					  Snapshot snapshot_now,
+					  HistoricMVCCSnapshot snapshot_now,
 					  CommandId command_id,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
@@ -2191,7 +2191,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void
 ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn,
-						volatile Snapshot snapshot_now,
+						volatile HistoricMVCCSnapshot snapshot_now,
 						volatile CommandId command_id,
 						bool streaming)
 {
@@ -2779,7 +2779,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	Snapshot	snapshot_now;
+	HistoricMVCCSnapshot snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
 	txn->final_lsn = commit_lsn;
@@ -3251,7 +3251,7 @@ ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
  */
 void
 ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
-						 XLogRecPtr lsn, Snapshot snap)
+						 XLogRecPtr lsn, HistoricMVCCSnapshot snap)
 {
 	ReorderBufferChange *change = ReorderBufferAllocChange(rb);
 
@@ -3269,7 +3269,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
  */
 void
 ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
-							 XLogRecPtr lsn, Snapshot snap)
+							 XLogRecPtr lsn, HistoricMVCCSnapshot snap)
 {
 	ReorderBufferTXN *txn;
 	bool		is_new;
@@ -4043,14 +4043,14 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
-				Snapshot	snap;
+				HistoricMVCCSnapshot snap;
 				char	   *data;
 
 				snap = change->data.snapshot;
 
-				sz += sizeof(SnapshotData) +
+				sz += sizeof(HistoricMVCCSnapshotData) +
 					sizeof(TransactionId) * snap->xcnt +
-					sizeof(TransactionId) * snap->subxcnt;
+					sizeof(TransactionId) * snap->curxcnt;
 
 				/* make sure we have enough space */
 				ReorderBufferSerializeReserve(rb, sz);
@@ -4058,21 +4058,21 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				/* might have been reallocated above */
 				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
 
-				memcpy(data, snap, sizeof(SnapshotData));
-				data += sizeof(SnapshotData);
+				memcpy(data, snap, sizeof(HistoricMVCCSnapshotData));
+				data += sizeof(HistoricMVCCSnapshotData);
 
 				if (snap->xcnt)
 				{
-					memcpy(data, snap->xip,
+					memcpy(data, snap->committed_xids,
 						   sizeof(TransactionId) * snap->xcnt);
 					data += sizeof(TransactionId) * snap->xcnt;
 				}
 
-				if (snap->subxcnt)
+				if (snap->curxcnt)
 				{
-					memcpy(data, snap->subxip,
-						   sizeof(TransactionId) * snap->subxcnt);
-					data += sizeof(TransactionId) * snap->subxcnt;
+					memcpy(data, snap->curxip,
+						   sizeof(TransactionId) * snap->curxcnt);
+					data += sizeof(TransactionId) * snap->curxcnt;
 				}
 				break;
 			}
@@ -4177,7 +4177,7 @@ ReorderBufferCanStartStreaming(ReorderBuffer *rb)
 static void
 ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	Snapshot	snapshot_now;
+	HistoricMVCCSnapshot snapshot_now;
 	CommandId	command_id;
 	Size		stream_bytes;
 	bool		txn_is_streamed;
@@ -4196,10 +4196,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * After that we need to reuse the snapshot from the previous run.
 	 *
 	 * Unlike DecodeCommit which adds xids of all the subtransactions in
-	 * snapshot's xip array via SnapBuildCommitTxn, we can't do that here but
-	 * we do add them to subxip array instead via ReorderBufferCopySnap. This
-	 * allows the catalog changes made in subtransactions decoded till now to
-	 * be visible.
+	 * snapshot's committed_xids array via SnapBuildCommitTxn, we can't do
+	 * that here but we do add them to curxip array instead via
+	 * ReorderBufferCopySnap. This allows the catalog changes made in
+	 * subtransactions decoded till now to be visible.
 	 */
 	if (txn->snapshot_now == NULL)
 	{
@@ -4345,13 +4345,13 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
-				Snapshot	snap;
+				HistoricMVCCSnapshot snap;
 
 				snap = change->data.snapshot;
 
-				sz += sizeof(SnapshotData) +
+				sz += sizeof(HistoricMVCCSnapshotData) +
 					sizeof(TransactionId) * snap->xcnt +
-					sizeof(TransactionId) * snap->subxcnt;
+					sizeof(TransactionId) * snap->curxcnt;
 
 				break;
 			}
@@ -4629,24 +4629,24 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
-				Snapshot	oldsnap;
-				Snapshot	newsnap;
+				HistoricMVCCSnapshot oldsnap;
+				HistoricMVCCSnapshot newsnap;
 				Size		size;
 
-				oldsnap = (Snapshot) data;
+				oldsnap = (HistoricMVCCSnapshot) data;
 
-				size = sizeof(SnapshotData) +
+				size = sizeof(HistoricMVCCSnapshotData) +
 					sizeof(TransactionId) * oldsnap->xcnt +
-					sizeof(TransactionId) * (oldsnap->subxcnt + 0);
+					sizeof(TransactionId) * (oldsnap->curxcnt + 0);
 
 				change->data.snapshot = MemoryContextAllocZero(rb->context, size);
 
 				newsnap = change->data.snapshot;
 
 				memcpy(newsnap, data, size);
-				newsnap->xip = (TransactionId *)
-					(((char *) newsnap) + sizeof(SnapshotData));
-				newsnap->subxip = newsnap->xip + newsnap->xcnt;
+				newsnap->committed_xids = (TransactionId *)
+					(((char *) newsnap) + sizeof(HistoricMVCCSnapshotData));
+				newsnap->curxip = newsnap->committed_xids + newsnap->xcnt;
 				newsnap->copied = true;
 				break;
 			}
@@ -5316,7 +5316,7 @@ file_sort_by_lsn(const ListCell *a_p, const ListCell *b_p)
  * transaction for relid.
  */
 static void
-UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
+UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, HistoricMVCCSnapshot snapshot)
 {
 	DIR		   *mapping_dir;
 	struct dirent *mapping_de;
@@ -5364,7 +5364,7 @@ UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
 			continue;
 
 		/* not for our transaction */
-		if (!TransactionIdInArray(f_mapped_xid, snapshot->subxip, snapshot->subxcnt))
+		if (!TransactionIdInArray(f_mapped_xid, snapshot->curxip, snapshot->curxcnt))
 			continue;
 
 		/* ok, relevant, queue for apply */
@@ -5383,7 +5383,7 @@ UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
 		RewriteMappingFile *f = (RewriteMappingFile *) lfirst(file);
 
 		elog(DEBUG1, "applying mapping: \"%s\" in %u", f->fname,
-			 snapshot->subxip[0]);
+			 snapshot->curxip[0]);
 		ApplyLogicalMappingFile(tuplecid_data, relid, f->fname);
 		pfree(f);
 	}
@@ -5395,7 +5395,7 @@ UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
  */
 bool
 ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
-							  Snapshot snapshot,
+							  HistoricMVCCSnapshot snapshot,
 							  HeapTuple htup, Buffer buffer,
 							  CommandId *cmin, CommandId *cmax)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..7a341418a74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,11 +155,11 @@ static bool ExportInProgress = false;
 static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static HistoricMVCCSnapshot SnapBuildBuildSnapshot(SnapBuild *builder);
 
-static void SnapBuildFreeSnapshot(Snapshot snap);
+static void SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap);
 
-static void SnapBuildSnapIncRefcount(Snapshot snap);
+static void SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
@@ -249,23 +249,21 @@ FreeSnapshotBuilder(SnapBuild *builder)
  * Free an unreferenced snapshot that has previously been built by us.
  */
 static void
-SnapBuildFreeSnapshot(Snapshot snap)
+SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap)
 {
 	/* make sure we don't get passed an external snapshot */
 	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
 
 	/* make sure nobody modified our snapshot */
 	Assert(snap->curcid == FirstCommandId);
-	Assert(!snap->suboverflowed);
-	Assert(!snap->takenDuringRecovery);
 	Assert(snap->regd_count == 0);
 
 	/* slightly more likely, so it's checked even without c-asserts */
 	if (snap->copied)
 		elog(ERROR, "cannot free a copied snapshot");
 
-	if (snap->active_count)
-		elog(ERROR, "cannot free an active snapshot");
+	if (snap->refcount)
+		elog(ERROR, "cannot free a snapshot that's in use");
 
 	pfree(snap);
 }
@@ -313,9 +311,9 @@ SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
  * adding a Snapshot as builder->snapshot.
  */
 static void
-SnapBuildSnapIncRefcount(Snapshot snap)
+SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap)
 {
-	snap->active_count++;
+	snap->refcount++;
 }
 
 /*
@@ -325,26 +323,23 @@ SnapBuildSnapIncRefcount(Snapshot snap)
  * IncRef'ed Snapshot can adjust its refcount easily.
  */
 void
-SnapBuildSnapDecRefcount(Snapshot snap)
+SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap)
 {
 	/* make sure we don't get passed an external snapshot */
 	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
 
 	/* make sure nobody modified our snapshot */
 	Assert(snap->curcid == FirstCommandId);
-	Assert(!snap->suboverflowed);
-	Assert(!snap->takenDuringRecovery);
 
+	Assert(snap->refcount > 0);
 	Assert(snap->regd_count == 0);
 
-	Assert(snap->active_count > 0);
-
 	/* slightly more likely, so it's checked even without casserts */
 	if (snap->copied)
 		elog(ERROR, "cannot free a copied snapshot");
 
-	snap->active_count--;
-	if (snap->active_count == 0)
+	snap->refcount--;
+	if (snap->refcount == 0)
 		SnapBuildFreeSnapshot(snap);
 }
 
@@ -356,15 +351,15 @@ SnapBuildSnapDecRefcount(Snapshot snap)
  * these snapshots; they have to copy them and fill in appropriate ->curcid
  * and ->subxip/subxcnt values.
  */
-static Snapshot
+static HistoricMVCCSnapshot
 SnapBuildBuildSnapshot(SnapBuild *builder)
 {
-	Snapshot	snapshot;
+	HistoricMVCCSnapshot snapshot;
 	Size		ssize;
 
 	Assert(builder->state >= SNAPBUILD_FULL_SNAPSHOT);
 
-	ssize = sizeof(SnapshotData)
+	ssize = sizeof(HistoricMVCCSnapshotData)
 		+ sizeof(TransactionId) * builder->committed.xcnt
 		+ sizeof(TransactionId) * 1 /* toplevel xid */ ;
 
@@ -400,31 +395,28 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->xmax = builder->xmax;
 
 	/* store all transactions to be treated as committed by this snapshot */
-	snapshot->xip =
-		(TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
+	snapshot->committed_xids =
+		(TransactionId *) ((char *) snapshot + sizeof(HistoricMVCCSnapshotData));
 	snapshot->xcnt = builder->committed.xcnt;
-	memcpy(snapshot->xip,
+	memcpy(snapshot->committed_xids,
 		   builder->committed.xip,
 		   builder->committed.xcnt * sizeof(TransactionId));
 
 	/* sort so we can bsearch() */
-	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+	qsort(snapshot->committed_xids, snapshot->xcnt, sizeof(TransactionId), xidComparator);
 
 	/*
-	 * Initially, subxip is empty, i.e. it's a snapshot to be used by
+	 * Initially, curxip is empty, i.e. it's a snapshot to be used by
 	 * transactions that don't modify the catalog. Will be filled by
 	 * ReorderBufferCopySnap() if necessary.
 	 */
-	snapshot->subxcnt = 0;
-	snapshot->subxip = NULL;
+	snapshot->curxcnt = 0;
+	snapshot->curxip = NULL;
 
-	snapshot->suboverflowed = false;
-	snapshot->takenDuringRecovery = false;
 	snapshot->copied = false;
 	snapshot->curcid = FirstCommandId;
-	snapshot->active_count = 0;
+	snapshot->refcount = 0;
 	snapshot->regd_count = 0;
-	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
@@ -436,13 +428,13 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
  * The snapshot will be usable directly in current transaction or exported
  * for loading in different transaction.
  */
-Snapshot
+MVCCSnapshot
 SnapBuildInitialSnapshot(SnapBuild *builder)
 {
-	Snapshot	snap;
+	HistoricMVCCSnapshot historicsnap;
+	MVCCSnapshot mvccsnap;
 	TransactionId xid;
 	TransactionId safeXid;
-	TransactionId *newxip;
 	int			newxcnt = 0;
 
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
@@ -464,10 +456,10 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	if (TransactionIdIsValid(MyProc->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
-	snap = SnapBuildBuildSnapshot(builder);
+	historicsnap = SnapBuildBuildSnapshot(builder);
 
 	/*
-	 * We know that snap->xmin is alive, enforced by the logical xmin
+	 * We know that historicsnap->xmin is alive, enforced by the logical xmin
 	 * mechanism. Due to that we can do this without locks, we're only
 	 * changing our own value.
 	 *
@@ -479,15 +471,18 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	safeXid = GetOldestSafeDecodingTransactionId(false);
 	LWLockRelease(ProcArrayLock);
 
-	if (TransactionIdFollows(safeXid, snap->xmin))
+	if (TransactionIdFollows(safeXid, historicsnap->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u",
-			 safeXid, snap->xmin);
+			 safeXid, historicsnap->xmin);
 
-	MyProc->xmin = snap->xmin;
+	MyProc->xmin = historicsnap->xmin;
 
 	/* allocate in transaction context */
-	newxip = (TransactionId *)
-		palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
+	mvccsnap = palloc(sizeof(MVCCSnapshotData) + sizeof(TransactionId) * GetMaxSnapshotXidCount());
+	mvccsnap->snapshot_type = SNAPSHOT_MVCC;
+	mvccsnap->xmin = historicsnap->xmin;
+	mvccsnap->xmax = historicsnap->xmax;
+	mvccsnap->xip = (TransactionId *) ((char *) mvccsnap + sizeof(MVCCSnapshotData));
 
 	/*
 	 * snapbuild.c builds transactions in an "inverted" manner, which means it
@@ -495,15 +490,15 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	 * classical snapshot by marking all non-committed transactions as
 	 * in-progress. This can be expensive.
 	 */
-	for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+	for (xid = historicsnap->xmin; NormalTransactionIdPrecedes(xid, historicsnap->xmax);)
 	{
 		void	   *test;
 
 		/*
-		 * Check whether transaction committed using the decoding snapshot
-		 * meaning of ->xip.
+		 * Check whether transaction committed using the decoding snapshot's
+		 * committed_xids array.
 		 */
-		test = bsearch(&xid, snap->xip, snap->xcnt,
+		test = bsearch(&xid, historicsnap->committed_xids, historicsnap->xcnt,
 					   sizeof(TransactionId), xidComparator);
 
 		if (test == NULL)
@@ -513,18 +508,27 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("initial slot snapshot too large")));
 
-			newxip[newxcnt++] = xid;
+			mvccsnap->xip[newxcnt++] = xid;
 		}
 
 		TransactionIdAdvance(xid);
 	}
-
-	/* adjust remaining snapshot fields as needed */
-	snap->snapshot_type = SNAPSHOT_MVCC;
-	snap->xcnt = newxcnt;
-	snap->xip = newxip;
-
-	return snap;
+	mvccsnap->xcnt = newxcnt;
+
+	/* Initialize remaining MVCCSnapshot fields */
+	mvccsnap->subxip = NULL;
+	mvccsnap->subxcnt = 0;
+	mvccsnap->suboverflowed = false;
+	mvccsnap->takenDuringRecovery = false;
+	mvccsnap->copied = true;
+	mvccsnap->curcid = FirstCommandId;
+	mvccsnap->active_count = 0;
+	mvccsnap->regd_count = 0;
+	mvccsnap->snapXactCompletionCount = 0;
+
+	pfree(historicsnap);
+
+	return mvccsnap;
 }
 
 /*
@@ -538,7 +542,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 const char *
 SnapBuildExportSnapshot(SnapBuild *builder)
 {
-	Snapshot	snap;
+	MVCCSnapshot snap;
 	char	   *snapname;
 
 	if (IsTransactionOrTransactionBlock())
@@ -575,7 +579,7 @@ SnapBuildExportSnapshot(SnapBuild *builder)
 /*
  * Ensure there is a snapshot and if not build one for current transaction.
  */
-Snapshot
+HistoricMVCCSnapshot
 SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
 {
 	Assert(builder->state == SNAPBUILD_CONSISTENT);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..1a7a35e25eb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1307,7 +1307,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		}
 		else if (snapshot_action == CRS_USE_SNAPSHOT)
 		{
-			Snapshot	snap;
+			MVCCSnapshot snap;
 
 			snap = SnapBuildInitialSnapshot(ctx->snapshot_builder);
 			RestoreTransactionSnapshot(snap, MyProc);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5b945a9ee3..535755614a9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2092,7 +2092,7 @@ GetMaxSnapshotSubxidCount(void)
  * least in the case we already hold a snapshot), but that's for another day.
  */
 static bool
-GetSnapshotDataReuse(Snapshot snapshot)
+GetSnapshotDataReuse(MVCCSnapshot snapshot)
 {
 	uint64		curXactCompletionCount;
 
@@ -2171,8 +2171,8 @@ GetSnapshotDataReuse(Snapshot snapshot)
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
  */
-Snapshot
-GetSnapshotData(Snapshot snapshot)
+MVCCSnapshot
+GetSnapshotData(MVCCSnapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..dd52782ff22 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -449,10 +449,10 @@ static void SerialSetActiveSerXmin(TransactionId xid);
 
 static uint32 predicatelock_hash(const void *key, Size keysize);
 static void SummarizeOldestCommittedSxact(void);
-static Snapshot GetSafeSnapshot(Snapshot origSnapshot);
-static Snapshot GetSerializableTransactionSnapshotInt(Snapshot snapshot,
-													  VirtualTransactionId *sourcevxid,
-													  int sourcepid);
+static MVCCSnapshot GetSafeSnapshot(MVCCSnapshot origSnapshot);
+static MVCCSnapshot GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
+														  VirtualTransactionId *sourcevxid,
+														  int sourcepid);
 static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
 static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
 									  PREDICATELOCKTARGETTAG *parent);
@@ -1544,10 +1544,10 @@ SummarizeOldestCommittedSxact(void)
  *		for), the passed-in Snapshot pointer should reference a static data
  *		area that can safely be passed to GetSnapshotData.
  */
-static Snapshot
-GetSafeSnapshot(Snapshot origSnapshot)
+static MVCCSnapshot
+GetSafeSnapshot(MVCCSnapshot origSnapshot)
 {
-	Snapshot	snapshot;
+	MVCCSnapshot snapshot;
 
 	Assert(XactReadOnly && XactDeferrable);
 
@@ -1668,8 +1668,8 @@ GetSafeSnapshotBlockingPids(int blocked_pid, int *output, int output_size)
  * always this same pointer; no new snapshot data structure is allocated
  * within this function.
  */
-Snapshot
-GetSerializableTransactionSnapshot(Snapshot snapshot)
+MVCCSnapshot
+GetSerializableTransactionSnapshot(MVCCSnapshot snapshot)
 {
 	Assert(IsolationIsSerializable());
 
@@ -1709,7 +1709,7 @@ GetSerializableTransactionSnapshot(Snapshot snapshot)
  * read-only.
  */
 void
-SetSerializableTransactionSnapshot(Snapshot snapshot,
+SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
 								   VirtualTransactionId *sourcevxid,
 								   int sourcepid)
 {
@@ -1750,8 +1750,8 @@ SetSerializableTransactionSnapshot(Snapshot snapshot,
  * source xact is still running after we acquire SerializableXactHashLock.
  * We do that by calling ProcArrayInstallImportedXmin.
  */
-static Snapshot
-GetSerializableTransactionSnapshotInt(Snapshot snapshot,
+static MVCCSnapshot
+GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 									  VirtualTransactionId *sourcevxid,
 									  int sourcepid)
 {
@@ -3961,12 +3961,12 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 static bool
 XidIsConcurrent(TransactionId xid)
 {
-	Snapshot	snap;
+	MVCCSnapshot snap;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny()));
 
-	snap = GetTransactionSnapshot();
+	snap = (MVCCSnapshot) GetTransactionSnapshot();
 
 	if (TransactionIdPrecedes(xid, snap->xmin))
 		return false;
@@ -4214,7 +4214,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 		}
 		else if (!SxactIsDoomed(sxact)
 				 && (!SxactIsCommitted(sxact)
-					 || TransactionIdPrecedes(GetTransactionSnapshot()->xmin,
+					 || TransactionIdPrecedes(TransactionXmin,
 											  sxact->finishedBefore))
 				 && !RWConflictExists(sxact, MySerializableXact))
 		{
@@ -4227,7 +4227,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 			 */
 			if (!SxactIsDoomed(sxact)
 				&& (!SxactIsCommitted(sxact)
-					|| TransactionIdPrecedes(GetTransactionSnapshot()->xmin,
+					|| TransactionIdPrecedes(TransactionXmin,
 											 sxact->finishedBefore))
 				&& !RWConflictExists(sxact, MySerializableXact))
 			{
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index 1da3964ca6f..d4aa8ef9e4e 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -372,10 +372,10 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	pg_snapshot *snap;
 	uint32		nxip,
 				i;
-	Snapshot	cur;
+	MVCCSnapshot cur;
 	FullTransactionId next_fxid = ReadNextFullTransactionId();
 
-	cur = GetActiveSnapshot();
+	cur = (MVCCSnapshot) GetActiveSnapshot();
 	if (cur == NULL)
 		elog(ERROR, "no active snapshot set");
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..78adb6d575a 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -137,18 +137,18 @@
  * These SnapshotData structs are static to simplify memory allocation
  * (see the hack in GetSnapshotData to avoid repeated malloc/free).
  */
-static SnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
-static SnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
-static SnapshotData CatalogSnapshotData = {SNAPSHOT_MVCC};
+static MVCCSnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
+static MVCCSnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
+static MVCCSnapshotData CatalogSnapshotData = {SNAPSHOT_MVCC};
 SnapshotData SnapshotSelfData = {SNAPSHOT_SELF};
 SnapshotData SnapshotAnyData = {SNAPSHOT_ANY};
 SnapshotData SnapshotToastData = {SNAPSHOT_TOAST};
 
 /* Pointers to valid snapshots */
-static Snapshot CurrentSnapshot = NULL;
-static Snapshot SecondarySnapshot = NULL;
-static Snapshot CatalogSnapshot = NULL;
-static Snapshot HistoricSnapshot = NULL;
+static MVCCSnapshot CurrentSnapshot = NULL;
+static MVCCSnapshot SecondarySnapshot = NULL;
+static MVCCSnapshot CatalogSnapshot = NULL;
+static HistoricMVCCSnapshot HistoricSnapshot = NULL;
 
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
@@ -171,7 +171,7 @@ static HTAB *tuplecid_data = NULL;
  */
 typedef struct ActiveSnapshotElt
 {
-	Snapshot	as_snap;
+	MVCCSnapshot as_snap;
 	int			as_level;
 	struct ActiveSnapshotElt *as_next;
 } ActiveSnapshotElt;
@@ -196,7 +196,7 @@ bool		FirstSnapshotSet = false;
  * FirstSnapshotSet in combination with IsolationUsesXactSnapshot(), because
  * GUC may be reset before us, changing the value of IsolationUsesXactSnapshot.
  */
-static Snapshot FirstXactSnapshot = NULL;
+static MVCCSnapshot FirstXactSnapshot = NULL;
 
 /* Define pathname of exported-snapshot files */
 #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
@@ -205,16 +205,16 @@ static Snapshot FirstXactSnapshot = NULL;
 typedef struct ExportedSnapshot
 {
 	char	   *snapfile;
-	Snapshot	snapshot;
+	MVCCSnapshot snapshot;
 } ExportedSnapshot;
 
 /* Current xact's exported snapshots (a list of ExportedSnapshot structs) */
 static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
+static MVCCSnapshot CopyMVCCSnapshot(MVCCSnapshot snapshot);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
+static void FreeMVCCSnapshot(MVCCSnapshot snapshot);
 static void SnapshotResetXmin(void);
 
 /* ResourceOwner callbacks to track snapshot references */
@@ -308,8 +308,9 @@ GetTransactionSnapshot(void)
 				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
 			else
 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+
 			/* Make a saved copy */
-			CurrentSnapshot = CopySnapshot(CurrentSnapshot);
+			CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
 			FirstXactSnapshot = CurrentSnapshot;
 			/* Mark it as "registered" in FirstXactSnapshot */
 			FirstXactSnapshot->regd_count++;
@@ -319,18 +320,18 @@ GetTransactionSnapshot(void)
 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
 		FirstSnapshotSet = true;
-		return CurrentSnapshot;
+		return (Snapshot) CurrentSnapshot;
 	}
 
 	if (IsolationUsesXactSnapshot())
-		return CurrentSnapshot;
+		return (Snapshot) CurrentSnapshot;
 
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
-	return CurrentSnapshot;
+	return (Snapshot) CurrentSnapshot;
 }
 
 /*
@@ -361,7 +362,7 @@ GetLatestSnapshot(void)
 
 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
 
-	return SecondarySnapshot;
+	return (Snapshot) SecondarySnapshot;
 }
 
 /*
@@ -380,7 +381,7 @@ GetCatalogSnapshot(Oid relid)
 	 * finishing decoding.
 	 */
 	if (HistoricSnapshotActive())
-		return HistoricSnapshot;
+		return (Snapshot) HistoricSnapshot;
 
 	return GetNonHistoricCatalogSnapshot(relid);
 }
@@ -426,7 +427,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 	}
 
-	return CatalogSnapshot;
+	return (Snapshot) CatalogSnapshot;
 }
 
 /*
@@ -495,7 +496,7 @@ SnapshotSetCommandId(CommandId curcid)
  * in GetTransactionSnapshot.
  */
 static void
-SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
+SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid,
 					   int sourcepid, PGPROC *sourceproc)
 {
 	/* Caller should have checked this already */
@@ -574,7 +575,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 			SetSerializableTransactionSnapshot(CurrentSnapshot, sourcevxid,
 											   sourcepid);
 		/* Make a saved copy */
-		CurrentSnapshot = CopySnapshot(CurrentSnapshot);
+		CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
 		FirstXactSnapshot = CurrentSnapshot;
 		/* Mark it as "registered" in FirstXactSnapshot */
 		FirstXactSnapshot->regd_count++;
@@ -585,29 +586,27 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 }
 
 /*
- * CopySnapshot
+ * CopyMVCCSnapshot
  *		Copy the given snapshot.
  *
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
  * to 0.  The returned snapshot has the copied flag set.
  */
-static Snapshot
-CopySnapshot(Snapshot snapshot)
+static MVCCSnapshot
+CopyMVCCSnapshot(MVCCSnapshot snapshot)
 {
-	Snapshot	newsnap;
+	MVCCSnapshot newsnap;
 	Size		subxipoff;
 	Size		size;
 
-	Assert(snapshot != InvalidSnapshot);
-
 	/* We allocate any XID arrays needed in the same palloc block. */
-	size = subxipoff = sizeof(SnapshotData) +
+	size = subxipoff = sizeof(MVCCSnapshotData) +
 		snapshot->xcnt * sizeof(TransactionId);
 	if (snapshot->subxcnt > 0)
 		size += snapshot->subxcnt * sizeof(TransactionId);
 
-	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
-	memcpy(newsnap, snapshot, sizeof(SnapshotData));
+	newsnap = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
+	memcpy(newsnap, snapshot, sizeof(MVCCSnapshotData));
 
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
@@ -644,11 +643,11 @@ CopySnapshot(Snapshot snapshot)
 }
 
 /*
- * FreeSnapshot
+ * FreeMVCCSnapshot
  *		Free the memory associated with a snapshot.
  */
 static void
-FreeSnapshot(Snapshot snapshot)
+FreeMVCCSnapshot(MVCCSnapshot snapshot)
 {
 	Assert(snapshot->regd_count == 0);
 	Assert(snapshot->active_count == 0);
@@ -664,6 +663,8 @@ FreeSnapshot(Snapshot snapshot)
  * If the passed snapshot is a statically-allocated one, or it is possibly
  * subject to a future command counter update, create a new long-lived copy
  * with active refcount=1.  Otherwise, only increment the refcount.
+ *
+ * Only regular MVCC snaphots can be used as the active snapshot.
  */
 void
 PushActiveSnapshot(Snapshot snapshot)
@@ -682,9 +683,12 @@ PushActiveSnapshot(Snapshot snapshot)
 void
 PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 {
+	MVCCSnapshot origsnap;
 	ActiveSnapshotElt *newactive;
 
-	Assert(snapshot != InvalidSnapshot);
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+	origsnap = &snapshot->mvcc;
+
 	Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
 
 	newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt));
@@ -693,11 +697,11 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	 * Checking SecondarySnapshot is probably useless here, but it seems
 	 * better to be sure.
 	 */
-	if (snapshot == CurrentSnapshot || snapshot == SecondarySnapshot ||
-		!snapshot->copied)
-		newactive->as_snap = CopySnapshot(snapshot);
+	if (origsnap == CurrentSnapshot || origsnap == SecondarySnapshot ||
+		!origsnap->copied)
+		newactive->as_snap = CopyMVCCSnapshot(origsnap);
 	else
-		newactive->as_snap = snapshot;
+		newactive->as_snap = origsnap;
 
 	newactive->as_next = ActiveSnapshot;
 	newactive->as_level = snap_level;
@@ -718,7 +722,8 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 void
 PushCopiedSnapshot(Snapshot snapshot)
 {
-	PushActiveSnapshot(CopySnapshot(snapshot));
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+	PushActiveSnapshot((Snapshot) CopyMVCCSnapshot(&snapshot->mvcc));
 }
 
 /*
@@ -771,7 +776,7 @@ PopActiveSnapshot(void)
 
 	if (ActiveSnapshot->as_snap->active_count == 0 &&
 		ActiveSnapshot->as_snap->regd_count == 0)
-		FreeSnapshot(ActiveSnapshot->as_snap);
+		FreeMVCCSnapshot(ActiveSnapshot->as_snap);
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
@@ -788,7 +793,7 @@ GetActiveSnapshot(void)
 {
 	Assert(ActiveSnapshot != NULL);
 
-	return ActiveSnapshot->as_snap;
+	return (Snapshot) ActiveSnapshot->as_snap;
 }
 
 /*
@@ -805,7 +810,8 @@ ActiveSnapshotSet(void)
  * RegisterSnapshot
  *		Register a snapshot as being in use by the current resource owner
  *
- * If InvalidSnapshot is passed, it is not registered.
+ * Only regular MVCC snaphots and "historic" MVCC snapshots can be registered.
+ * InvalidSnapshot is also accepted, as a no-op.
  */
 Snapshot
 RegisterSnapshot(Snapshot snapshot)
@@ -821,25 +827,39 @@ RegisterSnapshot(Snapshot snapshot)
  *		As above, but use the specified resource owner
  */
 Snapshot
-RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner)
+RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 {
-	Snapshot	snap;
+	MVCCSnapshot snapshot;
 
-	if (snapshot == InvalidSnapshot)
+	if (orig_snapshot == InvalidSnapshot)
 		return InvalidSnapshot;
 
+	if (orig_snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
+	{
+		HistoricMVCCSnapshot historicsnap = &orig_snapshot->historic_mvcc;
+
+		ResourceOwnerEnlarge(owner);
+		historicsnap->regd_count++;
+		ResourceOwnerRememberSnapshot(owner, (Snapshot) historicsnap);
+
+		return (Snapshot) historicsnap;
+	}
+
+	Assert(orig_snapshot->snapshot_type == SNAPSHOT_MVCC);
+	snapshot = &orig_snapshot->mvcc;
+
 	/* Static snapshot?  Create a persistent copy */
-	snap = snapshot->copied ? snapshot : CopySnapshot(snapshot);
+	snapshot = snapshot->copied ? snapshot : CopyMVCCSnapshot(snapshot);
 
 	/* and tell resowner.c about it */
 	ResourceOwnerEnlarge(owner);
-	snap->regd_count++;
-	ResourceOwnerRememberSnapshot(owner, snap);
+	snapshot->regd_count++;
+	ResourceOwnerRememberSnapshot(owner, (Snapshot) snapshot);
 
-	if (snap->regd_count == 1)
-		pairingheap_add(&RegisteredSnapshots, &snap->ph_node);
+	if (snapshot->regd_count == 1)
+		pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
 
-	return snap;
+	return (Snapshot) snapshot;
 }
 
 /*
@@ -875,18 +895,41 @@ UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner)
 static void
 UnregisterSnapshotNoOwner(Snapshot snapshot)
 {
-	Assert(snapshot->regd_count > 0);
-	Assert(!pairingheap_is_empty(&RegisteredSnapshots));
+	if (snapshot->snapshot_type == SNAPSHOT_MVCC)
+	{
+		MVCCSnapshot mvccsnap = &snapshot->mvcc;
+
+		Assert(mvccsnap->regd_count > 0);
+		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
 
-	snapshot->regd_count--;
-	if (snapshot->regd_count == 0)
-		pairingheap_remove(&RegisteredSnapshots, &snapshot->ph_node);
+		mvccsnap->regd_count--;
+		if (mvccsnap->regd_count == 0)
+			pairingheap_remove(&RegisteredSnapshots, &mvccsnap->ph_node);
 
-	if (snapshot->regd_count == 0 && snapshot->active_count == 0)
+		if (mvccsnap->regd_count == 0 && mvccsnap->active_count == 0)
+		{
+			FreeMVCCSnapshot(mvccsnap);
+			SnapshotResetXmin();
+		}
+	}
+	else if (snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 	{
-		FreeSnapshot(snapshot);
-		SnapshotResetXmin();
+		HistoricMVCCSnapshot historicsnap = &snapshot->historic_mvcc;
+
+		/*
+		 * Historic snapshots don't rely on the resource owner machinery for
+		 * cleanup, the snapbuild.c machinery ensures that whenever a historic
+		 * snapshot is in use, it has a non-zero refcount.  Registration is
+		 * only supported so that the callers don't need to treat regular MVCC
+		 * catalog snapshots and historic snapshots differently.
+		 */
+		Assert(historicsnap->refcount > 0);
+
+		Assert(historicsnap->regd_count > 0);
+		historicsnap->regd_count--;
 	}
+	else
+		elog(ERROR, "registered snapshot has unexpected type");
 }
 
 /*
@@ -896,8 +939,8 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 static int
 xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	const SnapshotData *asnap = pairingheap_const_container(SnapshotData, ph_node, a);
-	const SnapshotData *bsnap = pairingheap_const_container(SnapshotData, ph_node, b);
+	const MVCCSnapshotData *asnap = pairingheap_const_container(MVCCSnapshotData, ph_node, a);
+	const MVCCSnapshotData *bsnap = pairingheap_const_container(MVCCSnapshotData, ph_node, b);
 
 	if (TransactionIdPrecedes(asnap->xmin, bsnap->xmin))
 		return 1;
@@ -923,7 +966,7 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 static void
 SnapshotResetXmin(void)
 {
-	Snapshot	minSnapshot;
+	MVCCSnapshot minSnapshot;
 
 	if (ActiveSnapshot != NULL)
 		return;
@@ -934,7 +977,7 @@ SnapshotResetXmin(void)
 		return;
 	}
 
-	minSnapshot = pairingheap_container(SnapshotData, ph_node,
+	minSnapshot = pairingheap_container(MVCCSnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
@@ -984,7 +1027,7 @@ AtSubAbort_Snapshot(int level)
 
 		if (ActiveSnapshot->as_snap->active_count == 0 &&
 			ActiveSnapshot->as_snap->regd_count == 0)
-			FreeSnapshot(ActiveSnapshot->as_snap);
+			FreeMVCCSnapshot(ActiveSnapshot->as_snap);
 
 		/* and free the stack element */
 		pfree(ActiveSnapshot);
@@ -1006,7 +1049,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * In transaction-snapshot mode we must release our privately-managed
 	 * reference to the transaction snapshot.  We must remove it from
 	 * RegisteredSnapshots to keep the check below happy.  But we don't bother
-	 * to do FreeSnapshot, for two reasons: the memory will go away with
+	 * to do FreeMVCCSnapshot, for two reasons: the memory will go away with
 	 * TopTransactionContext anyway, and if someone has left the snapshot
 	 * stacked as active, we don't want the code below to be chasing through a
 	 * dangling pointer.
@@ -1099,7 +1142,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
  *		snapshot.
  */
 char *
-ExportSnapshot(Snapshot snapshot)
+ExportSnapshot(MVCCSnapshot snapshot)
 {
 	TransactionId topXid;
 	TransactionId *children;
@@ -1163,7 +1206,7 @@ ExportSnapshot(Snapshot snapshot)
 	 * ensure that the snapshot's xmin is honored for the rest of the
 	 * transaction.
 	 */
-	snapshot = CopySnapshot(snapshot);
+	snapshot = CopyMVCCSnapshot(snapshot);
 
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 	esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
@@ -1280,7 +1323,7 @@ pg_export_snapshot(PG_FUNCTION_ARGS)
 {
 	char	   *snapshotName;
 
-	snapshotName = ExportSnapshot(GetActiveSnapshot());
+	snapshotName = ExportSnapshot((MVCCSnapshot) GetActiveSnapshot());
 	PG_RETURN_TEXT_P(cstring_to_text(snapshotName));
 }
 
@@ -1384,7 +1427,7 @@ ImportSnapshot(const char *idstr)
 	Oid			src_dbid;
 	int			src_isolevel;
 	bool		src_readonly;
-	SnapshotData snapshot;
+	MVCCSnapshotData snapshot;
 
 	/*
 	 * Must be at top level of a fresh transaction.  Note in particular that
@@ -1653,7 +1696,7 @@ HaveRegisteredOrActiveSnapshot(void)
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(HistoricMVCCSnapshot historic_snapshot, HTAB *tuplecids)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -1696,11 +1739,10 @@ HistoricSnapshotGetTupleCids(void)
  * SerializedSnapshotData.
  */
 Size
-EstimateSnapshotSpace(Snapshot snapshot)
+EstimateSnapshotSpace(MVCCSnapshot snapshot)
 {
 	Size		size;
 
-	Assert(snapshot != InvalidSnapshot);
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 
 	/* We allocate any XID arrays needed in the same palloc block. */
@@ -1720,7 +1762,7 @@ EstimateSnapshotSpace(Snapshot snapshot)
  *		memory location at start_address.
  */
 void
-SerializeSnapshot(Snapshot snapshot, char *start_address)
+SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
 
@@ -1776,12 +1818,12 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
  * to 0.  The returned snapshot has the copied flag set.
  */
-Snapshot
+MVCCSnapshot
 RestoreSnapshot(char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
 	Size		size;
-	Snapshot	snapshot;
+	MVCCSnapshot snapshot;
 	TransactionId *serialized_xids;
 
 	memcpy(&serialized_snapshot, start_address,
@@ -1790,12 +1832,12 @@ RestoreSnapshot(char *start_address)
 		(start_address + sizeof(SerializedSnapshotData));
 
 	/* We allocate any XID arrays needed in the same palloc block. */
-	size = sizeof(SnapshotData)
+	size = sizeof(MVCCSnapshotData)
 		+ serialized_snapshot.xcnt * sizeof(TransactionId)
 		+ serialized_snapshot.subxcnt * sizeof(TransactionId);
 
 	/* Copy all required fields */
-	snapshot = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
+	snapshot = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
 	snapshot->snapshot_type = SNAPSHOT_MVCC;
 	snapshot->xmin = serialized_snapshot.xmin;
 	snapshot->xmax = serialized_snapshot.xmax;
@@ -1840,7 +1882,7 @@ RestoreSnapshot(char *start_address)
  * the declaration for PGPROC.
  */
 void
-RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
+RestoreTransactionSnapshot(MVCCSnapshot snapshot, void *source_pgproc)
 {
 	SetTransactionSnapshot(snapshot, NULL, InvalidPid, source_pgproc);
 }
@@ -1856,7 +1898,7 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
  * XID could not be ours anyway.
  */
 bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot)
 {
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..3d3ea109a4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -431,7 +431,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
  */
 struct HTAB;
 extern bool ResolveCminCmaxDuringDecoding(struct HTAB *tuplecid_data,
-										  Snapshot snapshot,
+										  HistoricMVCCSnapshot snapshot,
 										  HeapTuple htup,
 										  Buffer buffer,
 										  CommandId *cmin, CommandId *cmax);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..2626f2996d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -34,7 +34,7 @@ typedef struct TableScanDescData
 {
 	/* scan parameters */
 	Relation	rs_rd;			/* heap relation descriptor */
-	struct SnapshotData *rs_snapshot;	/* snapshot to see */
+	union SnapshotData *rs_snapshot;	/* snapshot to see */
 	int			rs_nkeys;		/* number of scan keys */
 	struct ScanKeyData *rs_key; /* array of scan key descriptors */
 
@@ -135,7 +135,7 @@ typedef struct IndexScanDescData
 	/* scan parameters */
 	Relation	heapRelation;	/* heap relation descriptor, or NULL */
 	Relation	indexRelation;	/* index relation descriptor */
-	struct SnapshotData *xs_snapshot;	/* snapshot to see */
+	union SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
@@ -210,7 +210,7 @@ typedef struct SysScanDescData
 	Relation	irel;			/* NULL if doing heap scan */
 	struct TableScanDescData *scan; /* only valid in storage-scan case */
 	struct IndexScanDescData *iscan;	/* only valid in index-scan case */
-	struct SnapshotData *snapshot;	/* snapshot to unregister at end of scan */
+	union SnapshotData *snapshot;	/* snapshot to unregister at end of scan */
 	struct TupleTableSlot *slot;
 }			SysScanDescData;
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..8bf72c64c94 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -127,7 +127,7 @@ typedef struct ReorderBufferChange
 		}			msg;
 
 		/* New snapshot, set when action == *_INTERNAL_SNAPSHOT */
-		Snapshot	snapshot;
+		HistoricMVCCSnapshot snapshot;
 
 		/*
 		 * New command id for existing snapshot in a catalog changing tx. Set
@@ -359,7 +359,7 @@ typedef struct ReorderBufferTXN
 	 * transaction modifies the catalog, or another catalog-modifying
 	 * transaction commits.
 	 */
-	Snapshot	base_snapshot;
+	HistoricMVCCSnapshot base_snapshot;
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
@@ -367,7 +367,7 @@ typedef struct ReorderBufferTXN
 	 * Snapshot/CID from the previous streaming run. Only valid for already
 	 * streamed transactions (NULL/InvalidCommandId otherwise).
 	 */
-	Snapshot	snapshot_now;
+	HistoricMVCCSnapshot snapshot_now;
 	CommandId	command_id;
 
 	/*
@@ -703,7 +703,7 @@ extern void ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid,
 									 XLogRecPtr lsn, ReorderBufferChange *change,
 									 bool toast_insert);
 extern void ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
-									  Snapshot snap, XLogRecPtr lsn,
+									  HistoricMVCCSnapshot snap, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
 extern void ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
@@ -727,9 +727,9 @@ extern void ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr
 extern void ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn);
 
 extern void ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
-										 XLogRecPtr lsn, Snapshot snap);
+										 XLogRecPtr lsn, HistoricMVCCSnapshot snap);
 extern void ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
-									 XLogRecPtr lsn, Snapshot snap);
+									 XLogRecPtr lsn, HistoricMVCCSnapshot snap);
 extern void ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 										 XLogRecPtr lsn, CommandId cid);
 extern void ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..5930ffb55a8 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -70,15 +70,15 @@ extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
 										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *builder);
 
-extern void SnapBuildSnapDecRefcount(Snapshot snap);
+extern void SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap);
 
-extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern MVCCSnapshot SnapBuildInitialSnapshot(SnapBuild *builder);
 extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
 extern void SnapBuildClearExportedSnapshot(void);
 extern void SnapBuildResetExportedSnapshotState(void);
 
 extern SnapBuildState SnapBuildCurrentState(SnapBuild *builder);
-extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
+extern HistoricMVCCSnapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr);
 extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..9bed20efa31 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -74,7 +74,7 @@ struct SnapBuild
 	/*
 	 * Snapshot that's valid to see the catalog state seen at this moment.
 	 */
-	Snapshot	snapshot;
+	HistoricMVCCSnapshot snapshot;
 
 	/*
 	 * LSN of the last location we are sure a snapshot has been serialized to.
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d90e94..6a78dfeac96 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -47,8 +47,8 @@ extern void CheckPointPredicate(void);
 extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
 
 /* predicate lock maintenance */
-extern Snapshot GetSerializableTransactionSnapshot(Snapshot snapshot);
-extern void SetSerializableTransactionSnapshot(Snapshot snapshot,
+extern MVCCSnapshot GetSerializableTransactionSnapshot(MVCCSnapshot snapshot);
+extern void SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
 											   VirtualTransactionId *sourcevxid,
 											   int sourcepid);
 extern void RegisterPredicateLockingXid(TransactionId xid);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ef0b733ebe8..7f5727c2586 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -44,7 +44,7 @@ extern void KnownAssignedTransactionIdsIdleMaintenance(void);
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
 
-extern Snapshot GetSnapshotData(Snapshot snapshot);
+extern MVCCSnapshot GetSnapshotData(MVCCSnapshot snapshot);
 
 extern bool ProcArrayInstallImportedXmin(TransactionId xmin,
 										 VirtualTransactionId *sourcevxid);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..1f627ff966d 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -49,7 +49,7 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
  */
 #define InitNonVacuumableSnapshot(snapshotdata, vistestp)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).vistest = (vistestp))
+	 (snapshotdata).nonvacuumable.vistest = (vistestp))
 
 /* This macro encodes the knowledge of which snapshots are MVCC-safe */
 #define IsMVCCSnapshot(snapshot)  \
@@ -89,7 +89,7 @@ extern void WaitForOlderSnapshots(TransactionId limitXmin, bool progress);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
 extern bool HaveRegisteredOrActiveSnapshot(void);
 
-extern char *ExportSnapshot(Snapshot snapshot);
+extern char *ExportSnapshot(MVCCSnapshot snapshot);
 
 /*
  * These live in procarray.c because they're intimately linked to the
@@ -105,18 +105,18 @@ extern bool GlobalVisCheckRemovableFullXid(Relation rel, FullTransactionId fxid)
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
-extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+extern bool XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot historic_snapshot, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(HistoricMVCCSnapshot historic_snapshot, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-extern Size EstimateSnapshotSpace(Snapshot snapshot);
-extern void SerializeSnapshot(Snapshot snapshot, char *start_address);
-extern Snapshot RestoreSnapshot(char *start_address);
-extern void RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc);
+extern Size EstimateSnapshotSpace(MVCCSnapshot snapshot);
+extern void SerializeSnapshot(MVCCSnapshot snapshot, char *start_address);
+extern MVCCSnapshot RestoreSnapshot(char *start_address);
+extern void RestoreTransactionSnapshot(MVCCSnapshot snapshot, void *source_pgproc);
 
 #endif							/* SNAPMGR_H */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..93c1f51784f 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -17,7 +17,7 @@
 
 
 /*
- * The different snapshot types.  We use SnapshotData structures to represent
+ * The different snapshot types.  We use the SnapshotData union to represent
  * both "regular" (MVCC) snapshots and "special" snapshots that have non-MVCC
  * semantics.  The specific semantics of a snapshot are encoded by its type.
  *
@@ -27,6 +27,9 @@
  * The reason the snapshot type rather than a callback as it used to be is
  * that that allows to use the same snapshot for different table AMs without
  * having one callback per AM.
+ *
+ * The executor deals with MVCC snapshots, but the table AM and some other
+ * parts of the system also support the special snapshots.
  */
 typedef enum SnapshotType
 {
@@ -100,7 +103,9 @@ typedef enum SnapshotType
 	/*
 	 * A tuple is visible iff it follows the rules of SNAPSHOT_MVCC, but
 	 * supports being called in timetravel context (for decoding catalog
-	 * contents in the context of logical decoding).
+	 * contents in the context of logical decoding).  A historic MVCC snapshot
+	 * should only be used on catalog tables, as we only track XIDs that
+	 * modify catalogs during logical decoding.
 	 */
 	SNAPSHOT_HISTORIC_MVCC,
 
@@ -114,37 +119,18 @@ typedef enum SnapshotType
 	SNAPSHOT_NON_VACUUMABLE,
 } SnapshotType;
 
-typedef struct SnapshotData *Snapshot;
-
-#define InvalidSnapshot		((Snapshot) NULL)
-
 /*
- * Struct representing all kind of possible snapshots.
+ * Struct representing a normal MVCC snapshot.
  *
- * There are several different kinds of snapshots:
- * * Normal MVCC snapshots
- * * MVCC snapshots taken during recovery (in Hot-Standby mode)
- * * Historic MVCC snapshots used during logical decoding
- * * snapshots passed to HeapTupleSatisfiesDirty()
- * * snapshots passed to HeapTupleSatisfiesNonVacuumable()
- * * snapshots used for SatisfiesAny, Toast, Self where no members are
- *	 accessed.
- *
- * TODO: It's probably a good idea to split this struct using a NodeTag
- * similar to how parser and executor nodes are handled, with one type for
- * each different kind of snapshot to avoid overloading the meaning of
- * individual fields.
+ * MVCC snapshots come in two variants: those taken during recovery in hot
+ * standby mode, and "normal" MVCC snapshots.  They are distinguished by
+ * takenDuringRecovery.
  */
-typedef struct SnapshotData
+typedef struct MVCCSnapshotData
 {
-	SnapshotType snapshot_type; /* type of snapshot */
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
 
 	/*
-	 * The remaining fields are used only for MVCC snapshots, and are normally
-	 * just zeroes in special snapshots.  (But xmin and xmax are used
-	 * specially by HeapTupleSatisfiesDirty, and xmin is used specially by
-	 * HeapTupleSatisfiesNonVacuumable.)
-	 *
 	 * An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
 	 * the effects of all older XIDs except those listed in the snapshot. xmin
 	 * is stored as an optimization to avoid needing to search the XID arrays
@@ -154,10 +140,8 @@ typedef struct SnapshotData
 	TransactionId xmax;			/* all XID >= xmax are invisible to me */
 
 	/*
-	 * For normal MVCC snapshot this contains the all xact IDs that are in
-	 * progress, unless the snapshot was taken during recovery in which case
-	 * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
-	 * it contains *committed* transactions between xmin and xmax.
+	 * xip contains the all xact IDs that are in progress, unless the snapshot
+	 * was taken during recovery in which case it's empty.
 	 *
 	 * note: all ids in xip[] satisfy xmin <= xip[i] < xmax
 	 */
@@ -165,10 +149,8 @@ typedef struct SnapshotData
 	uint32		xcnt;			/* # of xact ids in xip[] */
 
 	/*
-	 * For non-historic MVCC snapshots, this contains subxact IDs that are in
-	 * progress (and other transactions that are in progress if taken during
-	 * recovery). For historic snapshot it contains *all* xids assigned to the
-	 * replayed transaction, including the toplevel xid.
+	 * subxip contains subxact IDs that are in progress (and other
+	 * transactions that are in progress if taken during recovery).
 	 *
 	 * note: all ids in subxip[] are >= xmin, but we don't bother filtering
 	 * out any that are >= xmax
@@ -182,18 +164,6 @@ typedef struct SnapshotData
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
-	/*
-	 * An extra return value for HeapTupleSatisfiesDirty, not used in MVCC
-	 * snapshots.
-	 */
-	uint32		speculativeToken;
-
-	/*
-	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
-	 * used to determine whether row could be vacuumed.
-	 */
-	struct GlobalVisState *vistest;
-
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
@@ -207,6 +177,97 @@ typedef struct SnapshotData
 	 * transactions completed since the last GetSnapshotData().
 	 */
 	uint64		snapXactCompletionCount;
+} MVCCSnapshotData;
+
+typedef struct MVCCSnapshotData *MVCCSnapshot;
+
+#define InvalidMVCCSnapshot ((MVCCSnapshot) NULL)
+
+/*
+ * Struct representing a "historic" MVCC snapshot during logical decoding.
+ * These are constructed by src/replication/logical/snapbuild.c.
+ */
+typedef struct HistoricMVCCSnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
+
+	/*
+	 * xmin and xmax like in a normal MVCC snapshot.
+	 */
+	TransactionId xmin;			/* all XID < xmin are visible to me */
+	TransactionId xmax;			/* all XID >= xmax are invisible to me */
+
+	/*
+	 * committed_xids contains *committed* transactions between xmin and xmax.
+	 * (This is the inverse of 'xip' in normal MVCC snapshots, which contains
+	 * all non-committed transactions.)  The array is sorted by XID to allow
+	 * binary search.
+	 *
+	 * note: all ids in committed_xids[] satisfy xmin <= committed_xids[i] <
+	 * xmax
+	 */
+	TransactionId *committed_xids;
+	uint32		xcnt;			/* # of xact ids in committed_xids[] */
+
+	/*
+	 * curxip contains *all* xids assigned to the replayed transaction,
+	 * including the toplevel xid.
+	 */
+	TransactionId *curxip;
+	int32		curxcnt;		/* # of xact ids in curxip[] */
+
+	CommandId	curcid;			/* in my xact, CID < curcid are visible */
+
+	bool		copied;			/* false if it's a "base" snapshot */
+
+	uint32		refcount;		/* refcount managed by snapbuild.c  */
+	uint32		regd_count;		/* refcount registered with resource owners */
+
+} HistoricMVCCSnapshotData;
+
+typedef struct HistoricMVCCSnapshotData *HistoricMVCCSnapshot;
+
+/*
+ * Struct representing a special "snapshot" which sees all tuples as visible
+ * if they are visible to anyone, i.e. if they are not vacuumable.
+ * i.e. SNAPSHOT_NON_VACUUMABLE.
+ */
+typedef struct NonVacuumableSnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
+
+	/* This is used to determine whether row could be vacuumed. */
+	struct GlobalVisState *vistest;
+} NonVacuumableSnapshotData;
+
+/*
+ * Return values to the caller of HeapTupleSatisfyDirty.
+ */
+typedef struct DirtySnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
+
+	TransactionId xmin;
+	TransactionId xmax;
+	uint32		speculativeToken;
+} DirtySnapshotData;
+
+/*
+ * Generic union representing all kind of possible snapshots.  Some have
+ * type-specific structs.
+ */
+typedef union SnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot */
+
+	MVCCSnapshotData mvcc;
+	DirtySnapshotData dirty;
+	HistoricMVCCSnapshotData historic_mvcc;
+	NonVacuumableSnapshotData nonvacuumable;
 } SnapshotData;
 
+typedef union SnapshotData *Snapshot;
+
+#define InvalidSnapshot		((Snapshot) NULL)
+
 #endif							/* SNAPSHOT_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..c8ed18cf580 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -633,6 +633,7 @@ DictThesaurus
 DimensionInfo
 DirectoryMethodData
 DirectoryMethodFile
+DirtySnapshotData
 DisableTimeoutParams
 DiscardMode
 DiscardStmt
@@ -1183,6 +1184,7 @@ HeapTupleFreeze
 HeapTupleHeader
 HeapTupleHeaderData
 HeapTupleTableSlot
+HistoricMVCCSnapshotData
 HistControl
 HotStandbyState
 I32
@@ -1633,6 +1635,7 @@ MINIDUMPWRITEDUMP
 MINIDUMP_TYPE
 MJEvalResult
 MTTargetRelLookup
+MVCCSnapshotData
 MVDependencies
 MVDependency
 MVNDistinct
@@ -1732,6 +1735,7 @@ NextValueExpr
 Node
 NodeTag
 NonEmptyRange
+NonVacuumableSnapshotData
 Notification
 NotificationList
 NotifyStmt
-- 
2.39.5

v6-0002-Simplify-historic-snapshot-refcounting.patchtext/x-patch; charset=UTF-8; name=v6-0002-Simplify-historic-snapshot-refcounting.patchDownload

From 3228848876610c7b13216ffca6b42a9f5465e300 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 13 Mar 2025 16:45:12 +0200
Subject: [PATCH v6 02/12] Simplify historic snapshot refcounting

ReorderBufferProcessTXN() handled "copied" snapshots created with
ReorderBufferCopySnap() differently from "base" historic snapshots
created by snapbuild.c. The base snapshots used a reference count,
while copied snapshots did not. Simplify by using the reference count
for both.
---
 .../replication/logical/reorderbuffer.c       | 97 ++++++++-----------
 src/backend/replication/logical/snapbuild.c   | 48 +--------
 src/include/replication/snapbuild.h           |  1 +
 src/include/utils/snapshot.h                  |  2 -
 4 files changed, 46 insertions(+), 102 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e8196a8d5d5..e47970f1c82 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -103,7 +103,7 @@
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
-#include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
+#include "replication/snapbuild.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
@@ -268,7 +268,6 @@ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
 
-static void ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap);
 static HistoricMVCCSnapshot ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 												  ReorderBufferTXN *txn, CommandId cid);
 
@@ -543,7 +542,7 @@ ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
+				SnapBuildSnapDecRefcount(change->data.snapshot);
 				change->data.snapshot = NULL;
 			}
 			break;
@@ -1593,7 +1592,8 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (txn->snapshot_now != NULL)
 	{
 		Assert(rbtxn_is_streamed(txn));
-		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		SnapBuildSnapDecRefcount(txn->snapshot_now);
+		txn->snapshot_now = NULL;
 	}
 
 	/*
@@ -1902,7 +1902,6 @@ ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 	snap = MemoryContextAllocZero(rb->context, size);
 	memcpy(snap, orig_snap, sizeof(HistoricMVCCSnapshotData));
 
-	snap->copied = true;
 	snap->refcount = 1;			/* mark as active so nobody frees it */
 	snap->regd_count = 0;
 	snap->committed_xids = (TransactionId *) (snap + 1);
@@ -1942,18 +1941,6 @@ ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 	return snap;
 }
 
-/*
- * Free a previously ReorderBufferCopySnap'ed snapshot
- */
-static void
-ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap)
-{
-	if (snap->copied)
-		pfree(snap);
-	else
-		SnapBuildSnapDecRefcount(snap);
-}
-
 /*
  * If the transaction was (partially) streamed, we need to prepare or commit
  * it in a 'streamed' way.  That is, we first stream the remaining part of the
@@ -2104,11 +2091,8 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	txn->command_id = command_id;
 
 	/* Avoid copying if it's already copied. */
-	if (snapshot_now->copied)
-		txn->snapshot_now = snapshot_now;
-	else
-		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
-												  txn, command_id);
+	txn->snapshot_now = snapshot_now;
+	SnapBuildSnapIncRefcount(txn->snapshot_now);
 }
 
 /*
@@ -2208,6 +2192,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	/* setup the initial snapshot */
 	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	/* increase refcount for the installed historic snapshot */
+	SnapBuildSnapIncRefcount(snapshot_now);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -2511,33 +2497,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 					/* get rid of the old */
 					TeardownHistoricSnapshot(false);
-
-					if (snapshot_now->copied)
-					{
-						ReorderBufferFreeSnap(rb, snapshot_now);
-						snapshot_now =
-							ReorderBufferCopySnap(rb, change->data.snapshot,
-												  txn, command_id);
-					}
-
-					/*
-					 * Restored from disk, need to be careful not to double
-					 * free. We could introduce refcounting for that, but for
-					 * now this seems infrequent enough not to care.
-					 */
-					else if (change->data.snapshot->copied)
-					{
-						snapshot_now =
-							ReorderBufferCopySnap(rb, change->data.snapshot,
-												  txn, command_id);
-					}
-					else
-					{
-						snapshot_now = change->data.snapshot;
-					}
+					SnapBuildSnapDecRefcount(snapshot_now);
 
 					/* and continue with the new one */
+					snapshot_now = change->data.snapshot;
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SnapBuildSnapIncRefcount(snapshot_now);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -2547,16 +2512,26 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					{
 						command_id = change->data.command_id;
 
-						if (!snapshot_now->copied)
+						TeardownHistoricSnapshot(false);
+
+						/*
+						 * Construct a new snapshot with the new command ID.
+						 *
+						 * If this is the only reference to the snapshot, and
+						 * it's a "copied" snapshot that already contains all
+						 * the replayed transaction's XIDs (curxnct > 0), we
+						 * can take a shortcut and update the snapshot's
+						 * command ID in place.
+						 */
+						if (snapshot_now->refcount == 1 && snapshot_now->curxcnt > 0)
+							snapshot_now->curcid = command_id;
+						else
 						{
-							/* we don't use the global one anymore */
+							SnapBuildSnapDecRefcount(snapshot_now);
 							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
 																 txn, command_id);
 						}
 
-						snapshot_now->curcid = command_id;
-
-						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					}
 
@@ -2646,11 +2621,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
-		else if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
 
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
+		SnapBuildSnapDecRefcount(snapshot_now);
+		snapshot_now = NULL;
 
 		/*
 		 * Aborting the current (sub-)transaction as a whole has the right
@@ -2703,6 +2678,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		TeardownHistoricSnapshot(true);
 
+		/*
+		 * don't decrement the refcount on snapshot_now yet, we still use it
+		 * in the ReorderBufferResetTXN() call below.
+		 */
+
 		/*
 		 * Force cache invalidation to happen outside of a valid transaction
 		 * to prevent catalog access as we just caught an error.
@@ -2751,9 +2731,15 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
+
+			SnapBuildSnapDecRefcount(snapshot_now);
+			snapshot_now = NULL;
 		}
 		else
 		{
+			SnapBuildSnapDecRefcount(snapshot_now);
+			snapshot_now = NULL;
+
 			ReorderBufferCleanupTXN(rb, txn);
 			MemoryContextSwitchTo(ecxt);
 			PG_RE_THROW();
@@ -4256,8 +4242,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 											 txn, command_id);
 
 		/* Free the previously copied snapshot. */
-		Assert(txn->snapshot_now->copied);
-		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		SnapBuildSnapDecRefcount(txn->snapshot_now);
 		txn->snapshot_now = NULL;
 	}
 
@@ -4647,7 +4632,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				newsnap->committed_xids = (TransactionId *)
 					(((char *) newsnap) + sizeof(HistoricMVCCSnapshotData));
 				newsnap->curxip = newsnap->committed_xids + newsnap->xcnt;
-				newsnap->copied = true;
+				newsnap->refcount = 1;
 				break;
 			}
 			/* the base struct contains all the data, easy peasy */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7a341418a74..50dca7cb758 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -157,10 +157,6 @@ static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 /* snapshot building/manipulation/distribution functions */
 static HistoricMVCCSnapshot SnapBuildBuildSnapshot(SnapBuild *builder);
 
-static void SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap);
-
-static void SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap);
-
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
@@ -245,29 +241,6 @@ FreeSnapshotBuilder(SnapBuild *builder)
 	MemoryContextDelete(context);
 }
 
-/*
- * Free an unreferenced snapshot that has previously been built by us.
- */
-static void
-SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap)
-{
-	/* make sure we don't get passed an external snapshot */
-	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
-
-	/* make sure nobody modified our snapshot */
-	Assert(snap->curcid == FirstCommandId);
-	Assert(snap->regd_count == 0);
-
-	/* slightly more likely, so it's checked even without c-asserts */
-	if (snap->copied)
-		elog(ERROR, "cannot free a copied snapshot");
-
-	if (snap->refcount)
-		elog(ERROR, "cannot free a snapshot that's in use");
-
-	pfree(snap);
-}
-
 /*
  * In which state of snapshot building are we?
  */
@@ -310,7 +283,7 @@ SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
  * This is used when handing out a snapshot to some external resource or when
  * adding a Snapshot as builder->snapshot.
  */
-static void
+void
 SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap)
 {
 	snap->refcount++;
@@ -318,9 +291,6 @@ SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap)
 
 /*
  * Decrease refcount of a snapshot and free if the refcount reaches zero.
- *
- * Externally visible, so that external resources that have been handed an
- * IncRef'ed Snapshot can adjust its refcount easily.
  */
 void
 SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap)
@@ -328,19 +298,12 @@ SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap)
 	/* make sure we don't get passed an external snapshot */
 	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
 
-	/* make sure nobody modified our snapshot */
-	Assert(snap->curcid == FirstCommandId);
-
 	Assert(snap->refcount > 0);
 	Assert(snap->regd_count == 0);
 
-	/* slightly more likely, so it's checked even without casserts */
-	if (snap->copied)
-		elog(ERROR, "cannot free a copied snapshot");
-
 	snap->refcount--;
 	if (snap->refcount == 0)
-		SnapBuildFreeSnapshot(snap);
+		pfree(snap);
 }
 
 /*
@@ -413,7 +376,6 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curxcnt = 0;
 	snapshot->curxip = NULL;
 
-	snapshot->copied = false;
 	snapshot->curcid = FirstCommandId;
 	snapshot->refcount = 0;
 	snapshot->regd_count = 0;
@@ -1037,18 +999,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
 		builder->snapshot = SnapBuildBuildSnapshot(builder);
+		SnapBuildSnapIncRefcount(builder->snapshot);
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
 		{
-			SnapBuildSnapIncRefcount(builder->snapshot);
 			ReorderBufferSetBaseSnapshot(builder->reorder, xid, lsn,
 										 builder->snapshot);
+			SnapBuildSnapIncRefcount(builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new catalog snapshot to all currently running transactions */
 		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 5930ffb55a8..6095013a299 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -70,6 +70,7 @@ extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
 										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *builder);
 
+extern void SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap);
 extern void SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap);
 
 extern MVCCSnapshot SnapBuildInitialSnapshot(SnapBuild *builder);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 93c1f51784f..bca0ad16e68 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -218,8 +218,6 @@ typedef struct HistoricMVCCSnapshotData
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
-	bool		copied;			/* false if it's a "base" snapshot */
-
 	uint32		refcount;		/* refcount managed by snapbuild.c  */
 	uint32		regd_count;		/* refcount registered with resource owners */
 
-- 
2.39.5

v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patchtext/x-patch; charset=UTF-8; name=v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patchDownload

From 1705639a73555d9b3f5884c7fd90540c268d3db5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 23:47:48 +0300
Subject: [PATCH v6 03/12] Add an explicit 'valid' flag to MVCCSnapshotData

The lifetime of the "static" snapshots returned by
GetTransactionSnapshot(), GetLatestSnapshot() and GetCatalogSnapshot()
is a bit vague. By adding an explicit 'valid' flag, we can make it
more clear when a function call updates a static snapshot, making it
valid, and when another function makes it invalid again. It's
currently only used in assertions, and can also be handy when
debugging.
---
 src/backend/storage/ipc/procarray.c |  2 ++
 src/backend/utils/time/snapmgr.c    | 15 +++++++++++++++
 src/include/utils/snapshot.h        |  1 +
 3 files changed, 18 insertions(+)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 535755614a9..ba5ed8960dd 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2135,6 +2135,7 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
+	snapshot->valid = true;
 
 	return true;
 }
@@ -2514,6 +2515,7 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
+	snapshot->valid = true;
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 78adb6d575a..69ed86b2101 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -447,6 +447,7 @@ InvalidateCatalogSnapshot(void)
 	{
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
+		CatalogSnapshotData.valid = false;
 		SnapshotResetXmin();
 	}
 }
@@ -611,6 +612,7 @@ CopyMVCCSnapshot(MVCCSnapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->valid = true;
 	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
@@ -652,6 +654,7 @@ FreeMVCCSnapshot(MVCCSnapshot snapshot)
 	Assert(snapshot->regd_count == 0);
 	Assert(snapshot->active_count == 0);
 	Assert(snapshot->copied);
+	Assert(snapshot->valid);
 
 	pfree(snapshot);
 }
@@ -688,6 +691,7 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 	origsnap = &snapshot->mvcc;
+	Assert(origsnap->valid);
 
 	Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
 
@@ -847,6 +851,7 @@ RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 
 	Assert(orig_snapshot->snapshot_type == SNAPSHOT_MVCC);
 	snapshot = &orig_snapshot->mvcc;
+	Assert(snapshot->valid);
 
 	/* Static snapshot?  Create a persistent copy */
 	snapshot = snapshot->copied ? snapshot : CopyMVCCSnapshot(snapshot);
@@ -968,6 +973,15 @@ SnapshotResetXmin(void)
 {
 	MVCCSnapshot minSnapshot;
 
+	/*
+	 * These static snapshots are not in the RegisteredSnapshots list, so we
+	 * might advance MyProc->xmin past their xmin. (Note that in case of
+	 * IsolationUsesXactSnapshot() == true, CurrentSnapshot points to the copy
+	 * in FirstSnapshot rather than CurrentSnapshotData.)
+	 */
+	CurrentSnapshotData.valid = false;
+	SecondarySnapshotData.valid = false;
+
 	if (ActiveSnapshot != NULL)
 		return;
 
@@ -1871,6 +1885,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->regd_count = 0;
 	snapshot->active_count = 0;
 	snapshot->copied = true;
+	snapshot->valid = true;
 
 	return snapshot;
 }
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index bca0ad16e68..1697c6df856 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -161,6 +161,7 @@ typedef struct MVCCSnapshotData
 
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
+	bool		valid;			/* is this snapshot valid? */
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
-- 
2.39.5

v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patchtext/x-patch; charset=UTF-8; name=v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patchDownload

From 8cc814dc2e9fef8feda7cca9a0f2591c371b8ece Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 21:44:43 +0300
Subject: [PATCH v6 04/12] Replace static snapshot pointers with the 'valid'
 flags

Previously, we used the pointers like SecondarySnapshot and
CatalogSnapshot to indicate whether the corresponding static snapshot
is valid or not, but now that we have an explicit flag in
MVCCSnapshotData for that, we replace checks like "SecondarySnapshot
!= NULL" with "SecondarySnapshotData.valid", and get rid of the
separate pointer variables.

The situation with CurrentSnapshot was a bit more
complicated. Usually, it pointed to CurrentSnapshotData, but could
also point to the palloc'd FirstXactSnapshot. This gets rid of the
palloc'd FirstXactSnapshot, and instead we just refrain from modifying
CurrentSnapshotData when in a serializable transaction.
---
 src/backend/utils/time/snapmgr.c | 147 +++++++++++++++----------------
 1 file changed, 70 insertions(+), 77 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 69ed86b2101..ea1e7d17b04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -67,8 +67,8 @@
  * In addition to snapshots pushed to the active snapshot stack, a snapshot
  * can be registered with a resource owner.
  *
- * The FirstXactSnapshot, if any, is treated a bit specially: we increment its
- * regd_count and list it in RegisteredSnapshots, but this reference is not
+ * If FirstXactSnapshotRegistered is set, we increment the static
+ * CurrentSnapshotData's regd_count and list it in RegisteredSnapshots, but this reference is not
  * tracked by a resource owner. We used to use the TopTransactionResourceOwner
  * to track this snapshot reference, but that introduces logical circularity
  * and thus makes it impossible to clean up in a sane fashion.  It's better to
@@ -145,9 +145,6 @@ SnapshotData SnapshotAnyData = {SNAPSHOT_ANY};
 SnapshotData SnapshotToastData = {SNAPSHOT_TOAST};
 
 /* Pointers to valid snapshots */
-static MVCCSnapshot CurrentSnapshot = NULL;
-static MVCCSnapshot SecondarySnapshot = NULL;
-static MVCCSnapshot CatalogSnapshot = NULL;
 static HistoricMVCCSnapshot HistoricSnapshot = NULL;
 
 /*
@@ -196,7 +193,7 @@ bool		FirstSnapshotSet = false;
  * FirstSnapshotSet in combination with IsolationUsesXactSnapshot(), because
  * GUC may be reset before us, changing the value of IsolationUsesXactSnapshot.
  */
-static MVCCSnapshot FirstXactSnapshot = NULL;
+static bool FirstXactSnapshotRegistered = false;
 
 /* Define pathname of exported-snapshot files */
 #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
@@ -288,7 +285,7 @@ GetTransactionSnapshot(void)
 		InvalidateCatalogSnapshot();
 
 		Assert(pairingheap_is_empty(&RegisteredSnapshots));
-		Assert(FirstXactSnapshot == NULL);
+		Assert(!FirstXactSnapshotRegistered);
 
 		if (IsInParallelMode())
 			elog(ERROR,
@@ -296,42 +293,44 @@ GetTransactionSnapshot(void)
 
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
-		 * end of xact regardless of what the caller does with it, so we must
-		 * make a copy of it rather than returning CurrentSnapshotData
-		 * directly.  Furthermore, if we're running in serializable mode,
-		 * predicate.c needs to wrap the snapshot fetch in its own processing.
+		 * end of xact regardless of what the caller does with it, so we keep
+		 * it in RegisteredSnapshots even though it's not tracked by any
+		 * resource owner.  Furthermore, if we're running in serializable
+		 * mode, predicate.c needs to wrap the snapshot fetch in its own
+		 * processing.
 		 */
 		if (IsolationUsesXactSnapshot())
 		{
 			/* First, create the snapshot in CurrentSnapshotData */
 			if (IsolationIsSerializable())
-				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
+				GetSerializableTransactionSnapshot(&CurrentSnapshotData);
 			else
-				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
-
-			/* Make a saved copy */
-			CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
-			FirstXactSnapshot = CurrentSnapshot;
-			/* Mark it as "registered" in FirstXactSnapshot */
-			FirstXactSnapshot->regd_count++;
-			pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+				GetSnapshotData(&CurrentSnapshotData);
+
+			/* Mark it as "registered" */
+			CurrentSnapshotData.regd_count++;
+			FirstXactSnapshotRegistered = true;
+			pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 		}
 		else
-			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+			GetSnapshotData(&CurrentSnapshotData);
 
 		FirstSnapshotSet = true;
-		return (Snapshot) CurrentSnapshot;
+		return (Snapshot) &CurrentSnapshotData;
 	}
 
 	if (IsolationUsesXactSnapshot())
-		return (Snapshot) CurrentSnapshot;
+	{
+		Assert(CurrentSnapshotData.valid);
+		return (Snapshot) &CurrentSnapshotData;
+	}
 
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
-	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+	GetSnapshotData(&CurrentSnapshotData);
 
-	return (Snapshot) CurrentSnapshot;
+	return (Snapshot) &CurrentSnapshotData;
 }
 
 /*
@@ -360,9 +359,9 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
-	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
+	GetSnapshotData(&SecondarySnapshotData);
 
-	return (Snapshot) SecondarySnapshot;
+	return (Snapshot) &SecondarySnapshotData;
 }
 
 /*
@@ -402,15 +401,15 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 	 * scan a relation for which neither catcache nor snapshot invalidations
 	 * are sent, we must refresh the snapshot every time.
 	 */
-	if (CatalogSnapshot &&
+	if (CatalogSnapshotData.valid &&
 		!RelationInvalidatesSnapshotsOnly(relid) &&
 		!RelationHasSysCache(relid))
 		InvalidateCatalogSnapshot();
 
-	if (CatalogSnapshot == NULL)
+	if (!CatalogSnapshotData.valid)
 	{
 		/* Get new snapshot. */
-		CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);
+		GetSnapshotData(&CatalogSnapshotData);
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
@@ -424,10 +423,10 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
+		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
 	}
 
-	return (Snapshot) CatalogSnapshot;
+	return (Snapshot) &CatalogSnapshotData;
 }
 
 /*
@@ -443,10 +442,9 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 void
 InvalidateCatalogSnapshot(void)
 {
-	if (CatalogSnapshot)
+	if (CatalogSnapshotData.valid)
 	{
-		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
-		CatalogSnapshot = NULL;
+		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
 		CatalogSnapshotData.valid = false;
 		SnapshotResetXmin();
 	}
@@ -465,7 +463,7 @@ InvalidateCatalogSnapshot(void)
 void
 InvalidateCatalogSnapshotConditionally(void)
 {
-	if (CatalogSnapshot &&
+	if (CatalogSnapshotData.valid &&
 		ActiveSnapshot == NULL &&
 		pairingheap_is_singular(&RegisteredSnapshots))
 		InvalidateCatalogSnapshot();
@@ -481,10 +479,10 @@ SnapshotSetCommandId(CommandId curcid)
 	if (!FirstSnapshotSet)
 		return;
 
-	if (CurrentSnapshot)
-		CurrentSnapshot->curcid = curcid;
-	if (SecondarySnapshot)
-		SecondarySnapshot->curcid = curcid;
+	if (CurrentSnapshotData.valid)
+		CurrentSnapshotData.curcid = curcid;
+	if (SecondarySnapshotData.valid)
+		SecondarySnapshotData.curcid = curcid;
 	/* Should we do the same with CatalogSnapshot? */
 }
 
@@ -507,7 +505,7 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	InvalidateCatalogSnapshot();
 
 	Assert(pairingheap_is_empty(&RegisteredSnapshots));
-	Assert(FirstXactSnapshot == NULL);
+	Assert(!FirstXactSnapshotRegistered);
 	Assert(!HistoricSnapshotActive());
 
 	/*
@@ -516,28 +514,28 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
 	 * the state for GlobalVis*.
 	 */
-	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+	GetSnapshotData(&CurrentSnapshotData);
 
 	/*
 	 * Now copy appropriate fields from the source snapshot.
 	 */
-	CurrentSnapshot->xmin = sourcesnap->xmin;
-	CurrentSnapshot->xmax = sourcesnap->xmax;
-	CurrentSnapshot->xcnt = sourcesnap->xcnt;
+	CurrentSnapshotData.xmin = sourcesnap->xmin;
+	CurrentSnapshotData.xmax = sourcesnap->xmax;
+	CurrentSnapshotData.xcnt = sourcesnap->xcnt;
 	Assert(sourcesnap->xcnt <= GetMaxSnapshotXidCount());
 	if (sourcesnap->xcnt > 0)
-		memcpy(CurrentSnapshot->xip, sourcesnap->xip,
+		memcpy(CurrentSnapshotData.xip, sourcesnap->xip,
 			   sourcesnap->xcnt * sizeof(TransactionId));
-	CurrentSnapshot->subxcnt = sourcesnap->subxcnt;
+	CurrentSnapshotData.subxcnt = sourcesnap->subxcnt;
 	Assert(sourcesnap->subxcnt <= GetMaxSnapshotSubxidCount());
 	if (sourcesnap->subxcnt > 0)
-		memcpy(CurrentSnapshot->subxip, sourcesnap->subxip,
+		memcpy(CurrentSnapshotData.subxip, sourcesnap->subxip,
 			   sourcesnap->subxcnt * sizeof(TransactionId));
-	CurrentSnapshot->suboverflowed = sourcesnap->suboverflowed;
-	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
+	CurrentSnapshotData.suboverflowed = sourcesnap->suboverflowed;
+	CurrentSnapshotData.takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
-	CurrentSnapshot->snapXactCompletionCount = 0;
+	CurrentSnapshotData.snapXactCompletionCount = 0;
 
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
@@ -552,13 +550,13 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	 */
 	if (sourceproc != NULL)
 	{
-		if (!ProcArrayInstallRestoredXmin(CurrentSnapshot->xmin, sourceproc))
+		if (!ProcArrayInstallRestoredXmin(CurrentSnapshotData.xmin, sourceproc))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("could not import the requested snapshot"),
 					 errdetail("The source transaction is not running anymore.")));
 	}
-	else if (!ProcArrayInstallImportedXmin(CurrentSnapshot->xmin, sourcevxid))
+	else if (!ProcArrayInstallImportedXmin(CurrentSnapshotData.xmin, sourcevxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("could not import the requested snapshot"),
@@ -567,20 +565,19 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 
 	/*
 	 * In transaction-snapshot mode, the first snapshot must live until end of
-	 * xact, so we must make a copy of it.  Furthermore, if we're running in
-	 * serializable mode, predicate.c needs to do its own processing.
+	 * xact, so we include it in RegisteredSnapshots.  Furthermore, if we're
+	 * running in serializable mode, predicate.c needs to do its own
+	 * processing.
 	 */
 	if (IsolationUsesXactSnapshot())
 	{
 		if (IsolationIsSerializable())
-			SetSerializableTransactionSnapshot(CurrentSnapshot, sourcevxid,
+			SetSerializableTransactionSnapshot(&CurrentSnapshotData, sourcevxid,
 											   sourcepid);
-		/* Make a saved copy */
-		CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
-		FirstXactSnapshot = CurrentSnapshot;
-		/* Mark it as "registered" in FirstXactSnapshot */
-		FirstXactSnapshot->regd_count++;
-		pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+		/* Mark it as "registered" */
+		FirstXactSnapshotRegistered = true;
+		CurrentSnapshotData.regd_count++;
+		pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 	}
 
 	FirstSnapshotSet = true;
@@ -701,8 +698,7 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	 * Checking SecondarySnapshot is probably useless here, but it seems
 	 * better to be sure.
 	 */
-	if (origsnap == CurrentSnapshot || origsnap == SecondarySnapshot ||
-		!origsnap->copied)
+	if (!origsnap->copied)
 		newactive->as_snap = CopyMVCCSnapshot(origsnap);
 	else
 		newactive->as_snap = origsnap;
@@ -974,12 +970,10 @@ SnapshotResetXmin(void)
 	MVCCSnapshot minSnapshot;
 
 	/*
-	 * These static snapshots are not in the RegisteredSnapshots list, so we
-	 * might advance MyProc->xmin past their xmin. (Note that in case of
-	 * IsolationUsesXactSnapshot() == true, CurrentSnapshot points to the copy
-	 * in FirstSnapshot rather than CurrentSnapshotData.)
+	 * Invalidate these static snapshots so that we can advance xmin.
 	 */
-	CurrentSnapshotData.valid = false;
+	if (!FirstXactSnapshotRegistered)
+		CurrentSnapshotData.valid = false;
 	SecondarySnapshotData.valid = false;
 
 	if (ActiveSnapshot != NULL)
@@ -1068,13 +1062,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * stacked as active, we don't want the code below to be chasing through a
 	 * dangling pointer.
 	 */
-	if (FirstXactSnapshot != NULL)
+	if (FirstXactSnapshotRegistered)
 	{
-		Assert(FirstXactSnapshot->regd_count > 0);
+		Assert(CurrentSnapshotData.regd_count > 0);
 		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
-		pairingheap_remove(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+		pairingheap_remove(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
+		FirstXactSnapshotRegistered = false;
 	}
-	FirstXactSnapshot = NULL;
 
 	/*
 	 * If we exported any snapshots, clean them up.
@@ -1132,9 +1126,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	ActiveSnapshot = NULL;
 	pairingheap_reset(&RegisteredSnapshots);
 
-	CurrentSnapshot = NULL;
-	SecondarySnapshot = NULL;
-
+	CurrentSnapshotData.valid = false;
+	SecondarySnapshotData.valid = false;
 	FirstSnapshotSet = false;
 
 	/*
@@ -1695,7 +1688,7 @@ HaveRegisteredOrActiveSnapshot(void)
 	 * removed at any time due to invalidation processing. If explicitly
 	 * registered more than one snapshot has to be in RegisteredSnapshots.
 	 */
-	if (CatalogSnapshot != NULL &&
+	if (CatalogSnapshotData.valid &&
 		pairingheap_is_singular(&RegisteredSnapshots))
 		return false;
 
-- 
2.39.5

v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patchtext/x-patch; charset=UTF-8; name=v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patchDownload

From 34b92db816f87fb06d8eff3c07e60c81b322e44d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 19:54:39 +0300
Subject: [PATCH v6 05/12] Make RestoreSnapshot register the snapshot with
 current resowner

This simplifies the next commit
---
 src/backend/access/index/indexam.c    | 1 -
 src/backend/access/table/tableam.c    | 1 -
 src/backend/access/transam/parallel.c | 4 ++++
 src/backend/utils/time/snapmgr.c      | 8 +++++++-
 4 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 769170a37d5..8f0ae02221c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -592,7 +592,6 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
 
 	snapshot = (Snapshot) RestoreSnapshot(pscan->ps_snapshot_data);
-	snapshot = RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
 									pscan, true);
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4eb81e40d99..fc823cf84e5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -175,7 +175,6 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = (Snapshot) RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
-		snapshot = RegisterSnapshot(snapshot);
 		flags |= SO_TEMP_SNAPSHOT;
 	}
 	else
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8046e14abf7..e13ea57efff 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1499,6 +1499,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	UnregisterSnapshot(asnapshot);
+	if (tsnapshot != asnapshot)
+		UnregisterSnapshot(tsnapshot);
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea1e7d17b04..ef579128d3f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1823,7 +1823,7 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
  *		Restore a serialized snapshot from the specified address.
  *
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
- * to 0.  The returned snapshot has the copied flag set.
+ * to 0.  The returned snapshot is registered with the current resource owner.
  */
 MVCCSnapshot
 RestoreSnapshot(char *start_address)
@@ -1880,6 +1880,12 @@ RestoreSnapshot(char *start_address)
 	snapshot->copied = true;
 	snapshot->valid = true;
 
+	/* and tell resowner.c about it, just like RegisterSnapshot() */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	snapshot->regd_count++;
+	ResourceOwnerRememberSnapshot(CurrentResourceOwner, (Snapshot) snapshot);
+	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+
 	return snapshot;
 }
 
-- 
2.39.5

v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patchtext/x-patch; charset=UTF-8; name=v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patchDownload

From db70117e68b6f745c5ab9289e263aede7a068ac7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 23:50:55 +0300
Subject: [PATCH v6 06/12] Replace the RegisteredSnapshot pairing heap with a
 linked list

Previously, we kept all the snapshots in a pairing heap, so that we
could cheaply find the snapshot with the smallest xmin. However, we
can easily use a doubly-linked list instead, which is a little
simpler. A newly acquired snapshot's xmin is always greater than or
equal to that of any previous snapshot's, so we can simply push new
snapshots to the tail of the list, and the oldest xmin is always at
the head.

Previously, we would only push a snapshot to the heap when it's
registered or pushed to the active stack, not immediately when the
GetSnapshotData() was called. Because of that, snapshots were
sometimes added to the heap out of order. But if we update the list
earlier, after each GetSnapshotData() call, it stays in order. That
means that the list now contains *all* valid snapshots, including the
snapshots that are in the active stack, and the static CurrentSnapshot
and SecondarySnapshot, whenever they are valid. (CatalogSnapshot was
already tracked by the heap)
---
 src/backend/utils/time/snapmgr.c    | 279 +++++++++++++++++-----------
 src/include/access/spgist_private.h |   1 +
 src/include/utils/snapshot.h        |   6 +-
 3 files changed, 175 insertions(+), 111 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ef579128d3f..1c39cc11609 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -67,32 +67,22 @@
  * In addition to snapshots pushed to the active snapshot stack, a snapshot
  * can be registered with a resource owner.
  *
- * If FirstXactSnapshotRegistered is set, we increment the static
- * CurrentSnapshotData's regd_count and list it in RegisteredSnapshots, but this reference is not
- * tracked by a resource owner. We used to use the TopTransactionResourceOwner
- * to track this snapshot reference, but that introduces logical circularity
- * and thus makes it impossible to clean up in a sane fashion.  It's better to
- * handle this reference as an internally-tracked registration, so that this
- * module is entirely lower-level than ResourceOwners.
+ * Xmin tracking
+ * -------------
  *
- * Likewise, any snapshots that have been exported by pg_export_snapshot
- * have regd_count = 1 and are listed in RegisteredSnapshots, but are not
- * tracked by any resource owner.
+ * All valid snapshots, whether they are "static", included the active stack,
+ * or registered with a resource owner, are tracked in a doubly-linked list,
+ * ValidSnapshots.  Any snapshots that have been exported by
+ * pg_export_snapshot() are also listed there.  (They have regd_count = 1,
+ * even though they are not tracked by any resource owner).
  *
- * Likewise, the CatalogSnapshot is listed in RegisteredSnapshots when it
- * is valid, but is not tracked by any resource owner.
- *
- * The same is true for historic snapshots used during logical decoding,
- * their lifetime is managed separately (as they live longer than one xact.c
- * transaction).
- *
- * These arrangements let us reset MyProc->xmin when there are no snapshots
+ * The list is in xmin order, so that the tail always contains the oldest
+ * snapshot.  That let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
- * Xmin is no longer referenced.  For simplicity however, only registered
- * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyProc->xmin except when the active-snapshot
- * stack is empty.
+ * Xmin is no longer referenced.
  *
+ * The lifetime of historic snapshots used during logical decoding is managed
+ * separately (as they live longer than one xact.c transaction).
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -111,7 +101,6 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "datatype/timestamp.h"
-#include "lib/pairingheap.h"
 #include "miscadmin.h"
 #include "port/pg_lfind.h"
 #include "storage/fd.h"
@@ -177,13 +166,10 @@ typedef struct ActiveSnapshotElt
 static ActiveSnapshotElt *ActiveSnapshot = NULL;
 
 /*
- * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
+ * Currently valid Snapshots.  Ordered in a heap by xmin, so that we can
  * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
-static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
-					 void *arg);
-
-static pairingheap RegisteredSnapshots = {&xmin_cmp, NULL, NULL};
+static dlist_head ValidSnapshots = DLIST_STATIC_INIT(ValidSnapshots);
 
 /* first GetTransactionSnapshot call in a transaction? */
 bool		FirstSnapshotSet = false;
@@ -213,6 +199,8 @@ static MVCCSnapshot CopyMVCCSnapshot(MVCCSnapshot snapshot);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
 static void FreeMVCCSnapshot(MVCCSnapshot snapshot);
 static void SnapshotResetXmin(void);
+static void valid_snapshots_push_tail(MVCCSnapshot snapshot);
+static void valid_snapshots_push_out_of_order(MVCCSnapshot snapshot);
 
 /* ResourceOwner callbacks to track snapshot references */
 static void ResOwnerReleaseSnapshot(Datum res);
@@ -284,7 +272,7 @@ GetTransactionSnapshot(void)
 		 */
 		InvalidateCatalogSnapshot();
 
-		Assert(pairingheap_is_empty(&RegisteredSnapshots));
+		Assert(dlist_is_empty(&ValidSnapshots));
 		Assert(!FirstXactSnapshotRegistered);
 
 		if (IsInParallelMode())
@@ -308,12 +296,13 @@ GetTransactionSnapshot(void)
 				GetSnapshotData(&CurrentSnapshotData);
 
 			/* Mark it as "registered" */
-			CurrentSnapshotData.regd_count++;
 			FirstXactSnapshotRegistered = true;
-			pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 		}
 		else
+		{
 			GetSnapshotData(&CurrentSnapshotData);
+		}
+		valid_snapshots_push_tail(&CurrentSnapshotData);
 
 		FirstSnapshotSet = true;
 		return (Snapshot) &CurrentSnapshotData;
@@ -321,6 +310,7 @@ GetTransactionSnapshot(void)
 
 	if (IsolationUsesXactSnapshot())
 	{
+		Assert(FirstXactSnapshotRegistered);
 		Assert(CurrentSnapshotData.valid);
 		return (Snapshot) &CurrentSnapshotData;
 	}
@@ -328,7 +318,10 @@ GetTransactionSnapshot(void)
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
+	if (CurrentSnapshotData.valid)
+		dlist_delete(&CurrentSnapshotData.node);
 	GetSnapshotData(&CurrentSnapshotData);
+	valid_snapshots_push_tail(&CurrentSnapshotData);
 
 	return (Snapshot) &CurrentSnapshotData;
 }
@@ -359,7 +352,10 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
+	if (SecondarySnapshotData.valid)
+		dlist_delete(&SecondarySnapshotData.node);
 	GetSnapshotData(&SecondarySnapshotData);
+	valid_snapshots_push_tail(&SecondarySnapshotData);
 
 	return (Snapshot) &SecondarySnapshotData;
 }
@@ -423,7 +419,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
+		valid_snapshots_push_tail(&CatalogSnapshotData);
 	}
 
 	return (Snapshot) &CatalogSnapshotData;
@@ -444,10 +440,21 @@ InvalidateCatalogSnapshot(void)
 {
 	if (CatalogSnapshotData.valid)
 	{
-		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
+		dlist_delete(&CatalogSnapshotData.node);
 		CatalogSnapshotData.valid = false;
-		SnapshotResetXmin();
 	}
+	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
+	{
+		dlist_delete(&CurrentSnapshotData.node);
+		CurrentSnapshotData.valid = false;
+	}
+	if (SecondarySnapshotData.valid)
+	{
+		dlist_delete(&SecondarySnapshotData.node);
+		SecondarySnapshotData.valid = false;
+	}
+
+	SnapshotResetXmin();
 }
 
 /*
@@ -464,8 +471,7 @@ void
 InvalidateCatalogSnapshotConditionally(void)
 {
 	if (CatalogSnapshotData.valid &&
-		ActiveSnapshot == NULL &&
-		pairingheap_is_singular(&RegisteredSnapshots))
+		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node)
 		InvalidateCatalogSnapshot();
 }
 
@@ -504,7 +510,6 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	/* Better do this to ensure following Assert succeeds. */
 	InvalidateCatalogSnapshot();
 
-	Assert(pairingheap_is_empty(&RegisteredSnapshots));
 	Assert(!FirstXactSnapshotRegistered);
 	Assert(!HistoricSnapshotActive());
 
@@ -576,9 +581,8 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 											   sourcepid);
 		/* Mark it as "registered" */
 		FirstXactSnapshotRegistered = true;
-		CurrentSnapshotData.regd_count++;
-		pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 	}
+	valid_snapshots_push_tail(&CurrentSnapshotData);
 
 	FirstSnapshotSet = true;
 }
@@ -699,7 +703,10 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	 * better to be sure.
 	 */
 	if (!origsnap->copied)
+	{
 		newactive->as_snap = CopyMVCCSnapshot(origsnap);
+		dlist_insert_after(&origsnap->node, &newactive->as_snap->node);
+	}
 	else
 		newactive->as_snap = origsnap;
 
@@ -722,8 +729,13 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 void
 PushCopiedSnapshot(Snapshot snapshot)
 {
+	MVCCSnapshot copy;
+
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
-	PushActiveSnapshot((Snapshot) CopyMVCCSnapshot(&snapshot->mvcc));
+
+	copy = CopyMVCCSnapshot(&snapshot->mvcc);
+	dlist_insert_after(&snapshot->mvcc.node, &copy->node);
+	PushActiveSnapshot((Snapshot) copy);
 }
 
 /*
@@ -776,7 +788,10 @@ PopActiveSnapshot(void)
 
 	if (ActiveSnapshot->as_snap->active_count == 0 &&
 		ActiveSnapshot->as_snap->regd_count == 0)
+	{
+		dlist_delete(&ActiveSnapshot->as_snap->node);
 		FreeMVCCSnapshot(ActiveSnapshot->as_snap);
+	}
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
@@ -850,16 +865,17 @@ RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 	Assert(snapshot->valid);
 
 	/* Static snapshot?  Create a persistent copy */
-	snapshot = snapshot->copied ? snapshot : CopyMVCCSnapshot(snapshot);
+	if (!snapshot->copied)
+	{
+		snapshot = CopyMVCCSnapshot(snapshot);
+		dlist_insert_after(&orig_snapshot->mvcc.node, &snapshot->node);
+	}
 
 	/* and tell resowner.c about it */
 	ResourceOwnerEnlarge(owner);
 	snapshot->regd_count++;
 	ResourceOwnerRememberSnapshot(owner, (Snapshot) snapshot);
 
-	if (snapshot->regd_count == 1)
-		pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
-
 	return (Snapshot) snapshot;
 }
 
@@ -901,14 +917,12 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 		MVCCSnapshot mvccsnap = &snapshot->mvcc;
 
 		Assert(mvccsnap->regd_count > 0);
-		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
+		Assert(!dlist_is_empty(&ValidSnapshots));
 
 		mvccsnap->regd_count--;
-		if (mvccsnap->regd_count == 0)
-			pairingheap_remove(&RegisteredSnapshots, &mvccsnap->ph_node);
-
 		if (mvccsnap->regd_count == 0 && mvccsnap->active_count == 0)
 		{
+			dlist_delete(&mvccsnap->node);
 			FreeMVCCSnapshot(mvccsnap);
 			SnapshotResetXmin();
 		}
@@ -933,24 +947,6 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 		elog(ERROR, "registered snapshot has unexpected type");
 }
 
-/*
- * Comparison function for RegisteredSnapshots heap.  Snapshots are ordered
- * by xmin, so that the snapshot with smallest xmin is at the top.
- */
-static int
-xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
-{
-	const MVCCSnapshotData *asnap = pairingheap_const_container(MVCCSnapshotData, ph_node, a);
-	const MVCCSnapshotData *bsnap = pairingheap_const_container(MVCCSnapshotData, ph_node, b);
-
-	if (TransactionIdPrecedes(asnap->xmin, bsnap->xmin))
-		return 1;
-	else if (TransactionIdFollows(asnap->xmin, bsnap->xmin))
-		return -1;
-	else
-		return 0;
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -972,21 +968,27 @@ SnapshotResetXmin(void)
 	/*
 	 * Invalidate these static snapshots so that we can advance xmin.
 	 */
-	if (!FirstXactSnapshotRegistered)
+	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
+	{
+		dlist_delete(&CurrentSnapshotData.node);
 		CurrentSnapshotData.valid = false;
-	SecondarySnapshotData.valid = false;
+	}
+	if (SecondarySnapshotData.valid)
+	{
+		dlist_delete(&SecondarySnapshotData.node);
+		SecondarySnapshotData.valid = false;
+	}
 
 	if (ActiveSnapshot != NULL)
 		return;
 
-	if (pairingheap_is_empty(&RegisteredSnapshots))
+	if (dlist_is_empty(&ValidSnapshots))
 	{
 		MyProc->xmin = TransactionXmin = InvalidTransactionId;
 		return;
 	}
 
-	minSnapshot = pairingheap_container(MVCCSnapshotData, ph_node,
-										pairingheap_first(&RegisteredSnapshots));
+	minSnapshot = dlist_head_element(MVCCSnapshotData, node, &ValidSnapshots);
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
 		MyProc->xmin = TransactionXmin = minSnapshot->xmin;
@@ -1035,7 +1037,10 @@ AtSubAbort_Snapshot(int level)
 
 		if (ActiveSnapshot->as_snap->active_count == 0 &&
 			ActiveSnapshot->as_snap->regd_count == 0)
+		{
+			dlist_delete(&ActiveSnapshot->as_snap->node);
 			FreeMVCCSnapshot(ActiveSnapshot->as_snap);
+		}
 
 		/* and free the stack element */
 		pfree(ActiveSnapshot);
@@ -1053,23 +1058,6 @@ AtSubAbort_Snapshot(int level)
 void
 AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 {
-	/*
-	 * In transaction-snapshot mode we must release our privately-managed
-	 * reference to the transaction snapshot.  We must remove it from
-	 * RegisteredSnapshots to keep the check below happy.  But we don't bother
-	 * to do FreeMVCCSnapshot, for two reasons: the memory will go away with
-	 * TopTransactionContext anyway, and if someone has left the snapshot
-	 * stacked as active, we don't want the code below to be chasing through a
-	 * dangling pointer.
-	 */
-	if (FirstXactSnapshotRegistered)
-	{
-		Assert(CurrentSnapshotData.regd_count > 0);
-		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
-		pairingheap_remove(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
-		FirstXactSnapshotRegistered = false;
-	}
-
 	/*
 	 * If we exported any snapshots, clean them up.
 	 */
@@ -1082,8 +1070,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 		 * it's too late to abort the transaction, and (2) leaving a leaked
 		 * file around has little real consequence anyway.
 		 *
-		 * We also need to remove the snapshots from RegisteredSnapshots to
-		 * prevent a warning below.
+		 * We also need to remove the snapshots from ValidSnapshots to prevent
+		 * a warning below.
 		 *
 		 * As with the FirstXactSnapshot, we don't need to free resources of
 		 * the snapshot itself as it will go away with the memory context.
@@ -1096,22 +1084,35 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 				elog(WARNING, "could not unlink file \"%s\": %m",
 					 esnap->snapfile);
 
-			pairingheap_remove(&RegisteredSnapshots,
-							   &esnap->snapshot->ph_node);
+			dlist_delete(&esnap->snapshot->node);
 		}
 
 		exportedSnapshots = NIL;
 	}
 
-	/* Drop catalog snapshot if any */
-	InvalidateCatalogSnapshot();
+	/* Drop all static snapshot */
+	if (CatalogSnapshotData.valid)
+	{
+		dlist_delete(&CatalogSnapshotData.node);
+		CatalogSnapshotData.valid = false;
+	}
+	if (CurrentSnapshotData.valid)
+	{
+		dlist_delete(&CurrentSnapshotData.node);
+		CurrentSnapshotData.valid = false;
+	}
+	if (SecondarySnapshotData.valid)
+	{
+		dlist_delete(&SecondarySnapshotData.node);
+		SecondarySnapshotData.valid = false;
+	}
 
 	/* On commit, complain about leftover snapshots */
 	if (isCommit)
 	{
 		ActiveSnapshotElt *active;
 
-		if (!pairingheap_is_empty(&RegisteredSnapshots))
+		if (!dlist_is_empty(&ValidSnapshots))
 			elog(WARNING, "registered snapshots seem to remain after cleanup");
 
 		/* complain about unpopped active snapshots */
@@ -1124,11 +1125,12 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * it'll go away with TopTransactionContext.
 	 */
 	ActiveSnapshot = NULL;
-	pairingheap_reset(&RegisteredSnapshots);
+	dlist_init(&ValidSnapshots);
 
 	CurrentSnapshotData.valid = false;
 	SecondarySnapshotData.valid = false;
 	FirstSnapshotSet = false;
+	FirstXactSnapshotRegistered = false;
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
@@ -1151,6 +1153,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 char *
 ExportSnapshot(MVCCSnapshot snapshot)
 {
+	MVCCSnapshot orig_snapshot;
 	TransactionId topXid;
 	TransactionId *children;
 	ExportedSnapshot *esnap;
@@ -1213,7 +1216,8 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	 * ensure that the snapshot's xmin is honored for the rest of the
 	 * transaction.
 	 */
-	snapshot = CopyMVCCSnapshot(snapshot);
+	orig_snapshot = snapshot;
+	snapshot = CopyMVCCSnapshot(orig_snapshot);
 
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 	esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
@@ -1223,7 +1227,7 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	MemoryContextSwitchTo(oldcxt);
 
 	snapshot->regd_count++;
-	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+	dlist_insert_after(&orig_snapshot->node, &snapshot->node);
 
 	/*
 	 * Fill buf with a text serialization of the snapshot, plus identification
@@ -1653,7 +1657,7 @@ DeleteAllExportedSnapshotFiles(void)
 
 /*
  * ThereAreNoPriorRegisteredSnapshots
- *		Is the registered snapshot count less than or equal to one?
+ *		Are there any snapshots other than the current active snapshot?
  *
  * Don't use this to settle important decisions.  While zero registrations and
  * no ActiveSnapshot would confirm a certain idleness, the system makes no
@@ -1662,11 +1666,25 @@ DeleteAllExportedSnapshotFiles(void)
 bool
 ThereAreNoPriorRegisteredSnapshots(void)
 {
-	if (pairingheap_is_empty(&RegisteredSnapshots) ||
-		pairingheap_is_singular(&RegisteredSnapshots))
-		return true;
+	dlist_iter	iter;
 
-	return false;
+	dlist_foreach(iter, &ValidSnapshots)
+	{
+		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+
+		if (FirstXactSnapshotRegistered)
+		{
+			Assert(CurrentSnapshotData.valid);
+			if (cur != &CurrentSnapshotData)
+				continue;
+		}
+		if (ActiveSnapshot && cur == ActiveSnapshot->as_snap)
+			continue;
+
+		return false;
+	}
+
+	return true;
 }
 
 /*
@@ -1684,15 +1702,18 @@ HaveRegisteredOrActiveSnapshot(void)
 		return true;
 
 	/*
-	 * The catalog snapshot is in RegisteredSnapshots when valid, but can be
+	 * The catalog snapshot is in ValidSnapshots when valid, but can be
 	 * removed at any time due to invalidation processing. If explicitly
-	 * registered more than one snapshot has to be in RegisteredSnapshots.
+	 * registered more than one snapshot has to be in ValidSnapshots.
 	 */
 	if (CatalogSnapshotData.valid &&
-		pairingheap_is_singular(&RegisteredSnapshots))
+		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node &&
+		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.node)
+	{
 		return false;
+	}
 
-	return !pairingheap_is_empty(&RegisteredSnapshots);
+	return !dlist_is_empty(&ValidSnapshots);
 }
 
 
@@ -1884,7 +1905,7 @@ RestoreSnapshot(char *start_address)
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 	snapshot->regd_count++;
 	ResourceOwnerRememberSnapshot(CurrentResourceOwner, (Snapshot) snapshot);
-	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+	valid_snapshots_push_out_of_order(snapshot);
 
 	return snapshot;
 }
@@ -2015,3 +2036,45 @@ ResOwnerReleaseSnapshot(Datum res)
 {
 	UnregisterSnapshotNoOwner((Snapshot) DatumGetPointer(res));
 }
+
+
+/* Helper functions to manipulate the ValidSnapshots list */
+
+/* dlist_push_tail, with assertion that the list stays ordered by xmin */
+static void
+valid_snapshots_push_tail(MVCCSnapshot snapshot)
+{
+#ifdef USE_ASSERT_CHECKING
+	if (!dlist_is_empty(&ValidSnapshots))
+	{
+		MVCCSnapshot tail = dlist_tail_element(MVCCSnapshotData, node, &ValidSnapshots);
+
+		Assert(TransactionIdFollowsOrEquals(snapshot->xmin, tail->xmin));
+	}
+#endif
+	dlist_push_tail(&ValidSnapshots, &snapshot->node);
+}
+
+/*
+ * Add an entry to the right position in the list, keeping it ordered by xmin.
+ *
+ * This is O(n), but that's OK because it's only used in rare occasions, when
+ * the list is small.
+ */
+static void
+valid_snapshots_push_out_of_order(MVCCSnapshot snapshot)
+{
+	dlist_iter	iter;
+
+	dlist_foreach(iter, &ValidSnapshots)
+	{
+		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+
+		if (TransactionIdFollowsOrEquals(snapshot->xmin, cur->xmin))
+		{
+			dlist_insert_after(&cur->node, &snapshot->node);
+			return;
+		}
+	}
+	dlist_push_tail(&ValidSnapshots, &snapshot->node);
+}
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..27ed1d77c9b 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -17,6 +17,7 @@
 #include "access/itup.h"
 #include "access/spgist.h"
 #include "catalog/pg_am_d.h"
+#include "lib/pairingheap.h"
 #include "nodes/tidbitmap.h"
 #include "storage/buf.h"
 #include "utils/geo_decls.h"
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 1697c6df856..44b3b20f73c 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,7 +13,7 @@
 #ifndef SNAPSHOT_H
 #define SNAPSHOT_H
 
-#include "lib/pairingheap.h"
+#include "lib/ilist.h"
 
 
 /*
@@ -169,8 +169,8 @@ typedef struct MVCCSnapshotData
 	 * Book-keeping information, used by the snapshot manager
 	 */
 	uint32		active_count;	/* refcount on ActiveSnapshot stack */
-	uint32		regd_count;		/* refcount on RegisteredSnapshots */
-	pairingheap_node ph_node;	/* link in the RegisteredSnapshots heap */
+	uint32		regd_count;		/* refcount of registrations in resowners */
+	dlist_node	node;			/* link in ValidSnapshots */
 
 	/*
 	 * The transaction completion count at the time GetSnapshotData() built
-- 
2.39.5

v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patchtext/x-patch; charset=UTF-8; name=v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patchDownload

From 05443030201d59216b3125d51c641b68decd4379 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 21:46:54 +0300
Subject: [PATCH v6 07/12] Split MVCCSnapshot into inner and outer parts

Split MVCCSnapshot into two parts: inner struct to hold the xmin, xmax
and XID arrays that determine which transactions are visible, and an
outer shell that includes the command ID and a pointer to the inner
struct. That way, the inner struct can be shared by snapshots derived
from the same original snapshot, just with different command IDs.

The inner struct, MVCCSnapshotShared, is reference counted separately
so that we can avoid copying it when pushing or registering a snapshot
for the first time. Also, GetMVCCSnapshotData() can reuse it more
aggressively: we always keep a pointer to the latest shared struct
(latestSnapshotShared), and GetMVCCSnapshotData() always tries to
reuse the same latest snapshot, regardless of whether it was called
from GetTransactionSnapshot(), GetLatestSnapshot(), or
GetCatalogSnapshot(). That avoids unnecessary copying. Snapshots are
usually small so that it doesn't matter, but it can help in extreme
cases where you have thousands of (sub-)XIDs in progress.

Now that the shared inner structs are reference counted, it seems
unnecessary to reference count the outer MVCCSnapshots
separately. That means that RegisterSnapshot() always makes a new
palloc'd copy of the outer struct, but that's pretty small. The
ActiveSnapshot stack entries now embed the outer struct directly, so
the 'active_count' is gone too.

The ValidSnapshots list now tracks the shared structs rather than the
outer snapshots. That's sufficient for finding the oldest xmin, but if
we ever wanted to also know the oldest command ID in use, we'd need to
track the outer structs instead.
---
 contrib/amcheck/verify_heapam.c             |   2 +-
 contrib/amcheck/verify_nbtree.c             |   2 +-
 src/backend/access/heap/heapam.c            |   2 +-
 src/backend/access/heap/heapam_handler.c    |   2 +-
 src/backend/access/heap/heapam_visibility.c |  18 +-
 src/backend/access/spgist/spgvacuum.c       |   2 +-
 src/backend/access/transam/README           |  26 +-
 src/backend/catalog/pg_inherits.c           |   6 +-
 src/backend/commands/async.c                |   2 +-
 src/backend/commands/indexcmds.c            |   4 +-
 src/backend/commands/tablecmds.c            |   2 +-
 src/backend/executor/execMain.c             |  12 +-
 src/backend/executor/execParallel.c         |   3 +-
 src/backend/partitioning/partdesc.c         |   2 +-
 src/backend/replication/logical/snapbuild.c |  40 +-
 src/backend/replication/walsender.c         |   2 +-
 src/backend/storage/ipc/procarray.c         | 138 +++--
 src/backend/storage/lmgr/predicate.c        | 109 ++--
 src/backend/utils/adt/xid8funcs.c           |   8 +-
 src/backend/utils/time/snapmgr.c            | 605 ++++++++++----------
 src/include/access/transam.h                |   4 +-
 src/include/storage/predicate.h             |   8 +-
 src/include/storage/proc.h                  |   2 +-
 src/include/storage/procarray.h             |   2 +-
 src/include/utils/snapmgr.h                 |  11 +-
 src/include/utils/snapshot.h                |  51 +-
 src/tools/pgindent/typedefs.list            |   2 +
 27 files changed, 536 insertions(+), 531 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6665cafc179..d7f0b772f94 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -310,7 +310,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
 	 * while we're running.
 	 */
-	ctx.safe_xmin = GetTransactionSnapshot()->mvcc.xmin;
+	ctx.safe_xmin = GetTransactionSnapshot()->mvcc.shared->xmin;
 
 	/*
 	 * If we report corruption when not examining some individual attribute,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e90b4a2ad5a..d77ded4cc40 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -458,7 +458,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 */
 			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
 				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->mvcc.xmin))
+									   snapshot->mvcc.shared->xmin))
 				ereport(ERROR,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0cfa100cbd1..0615ffa2bd1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -606,7 +606,7 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	 * tuple for visibility the hard way.
 	 */
 	all_visible = PageIsAllVisible(page) &&
-		(snapshot->snapshot_type != SNAPSHOT_MVCC || !snapshot->mvcc.takenDuringRecovery);
+		(snapshot->snapshot_type != SNAPSHOT_MVCC || !snapshot->mvcc.shared->takenDuringRecovery);
 	check_serializable =
 		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index fce657f00f6..b9a5b38dd08 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2308,7 +2308,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
 
 	page = (Page) BufferGetPage(hscan->rs_cbuf);
 	all_visible = PageIsAllVisible(page) &&
-		(scan->rs_snapshot->snapshot_type != SNAPSHOT_MVCC || !scan->rs_snapshot->mvcc.takenDuringRecovery);
+		(scan->rs_snapshot->snapshot_type != SNAPSHOT_MVCC || !scan->rs_snapshot->mvcc.shared->takenDuringRecovery);
 	maxoffset = PageGetMaxOffsetNumber(page);
 
 	for (;;)
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index f5d69b558f1..07f155498d4 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -19,7 +19,7 @@
  * That fixes that problem, but it also means there is a window where
  * TransactionIdIsInProgress and TransactionIdDidCommit will both return true.
  * If we check only TransactionIdDidCommit, we could consider a tuple
- * committed when a later GetSnapshotData call will still think the
+ * committed when a later GetMVCCSnapshotData call will still think the
  * originating transaction is in progress, which leads to application-level
  * inconsistency.  The upshot is that we gotta check TransactionIdIsInProgress
  * first in all code paths, except for a few cases where we are looking at
@@ -969,7 +969,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 	 * get invalidated while it's still in use, and this is a convenient place
 	 * to check for that.
 	 */
-	Assert(snapshot->regd_count > 0 || snapshot->active_count > 0);
+	Assert(snapshot->kind == SNAPSHOT_ACTIVE || snapshot->kind == SNAPSHOT_REGISTERED);
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
@@ -986,7 +986,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 
 			if (TransactionIdIsCurrentTransactionId(xvac))
 				return false;
-			if (!XidInMVCCSnapshot(xvac, snapshot))
+			if (!XidInMVCCSnapshot(xvac, snapshot->shared))
 			{
 				if (TransactionIdDidCommit(xvac))
 				{
@@ -1005,7 +1005,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 
 			if (!TransactionIdIsCurrentTransactionId(xvac))
 			{
-				if (XidInMVCCSnapshot(xvac, snapshot))
+				if (XidInMVCCSnapshot(xvac, snapshot->shared))
 					return false;
 				if (TransactionIdDidCommit(xvac))
 					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
@@ -1060,7 +1060,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 			else
 				return false;	/* deleted before scan started */
 		}
-		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
+		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot->shared))
 			return false;
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
 			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
@@ -1077,7 +1077,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 	{
 		/* xmin is committed, but maybe not according to our snapshot */
 		if (!HeapTupleHeaderXminFrozen(tuple) &&
-			XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
+			XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot->shared))
 			return false;		/* treat as still in progress */
 	}
 
@@ -1108,7 +1108,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 			else
 				return false;	/* deleted before scan started */
 		}
-		if (XidInMVCCSnapshot(xmax, snapshot))
+		if (XidInMVCCSnapshot(xmax, snapshot->shared))
 			return true;
 		if (TransactionIdDidCommit(xmax))
 			return false;		/* updating transaction committed */
@@ -1126,7 +1126,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 				return false;	/* deleted before scan started */
 		}
 
-		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
+		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot->shared))
 			return true;
 
 		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
@@ -1144,7 +1144,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 	else
 	{
 		/* xmax is committed, but maybe not according to our snapshot */
-		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
+		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot->shared))
 			return true;		/* treat as still in progress */
 	}
 
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 850ad36cd0a..0a8d7b0a0d6 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -808,7 +808,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->mvcc.xmin;
+	bds->myXmin = GetActiveSnapshot()->mvcc.shared->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 231106270fd..81792f0eab3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -231,7 +231,7 @@ we must ensure consistency about the commit order of transactions.
 For example, suppose an UPDATE in xact A is blocked by xact B's prior
 update of the same row, and xact B is doing commit while xact C gets a
 snapshot.  Xact A can complete and commit as soon as B releases its locks.
-If xact C's GetSnapshotData sees xact B as still running, then it had
+If xact C's GetMVCCSnapshotData sees xact B as still running, then it had
 better see xact A as still running as well, or it will be able to see two
 tuple versions - one deleted by xact B and one inserted by xact A.  Another
 reason why this would be bad is that C would see (in the row inserted by A)
@@ -248,8 +248,8 @@ with snapshot-taking: we do not allow any transaction to exit the set of
 running transactions while a snapshot is being taken.  (This rule is
 stronger than necessary for consistency, but is relatively simple to
 enforce, and it assists with some other issues as explained below.)  The
-implementation of this is that GetSnapshotData takes the ProcArrayLock in
-shared mode (so that multiple backends can take snapshots in parallel),
+implementation of this is that GetMVCCSnapshotData takes the ProcArrayLock
+in shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
 while clearing the ProcGlobal->xids[] entry at transaction end (either
 commit or abort). (To reduce context switching, when multiple transactions
@@ -257,7 +257,7 @@ commit nearly simultaneously, we have one backend take ProcArrayLock and
 clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
+latestCompletedXid variable.  This allows GetMVCCSnapshotData to use
 latestCompletedXid + 1 as xmax for its snapshot: there can be no
 transaction >= this xid value that the snapshot needs to consider as
 completed.
@@ -301,7 +301,7 @@ if it currently has no live snapshots (eg, if it's between transactions or
 hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
 the MIN() of the valid xmin fields.  It does this with only shared lock on
 ProcArrayLock, which means there is a potential race condition against other
-backends doing GetSnapshotData concurrently: we must be certain that a
+backends doing GetMVCCSnapshotData concurrently: we must be certain that a
 concurrent backend that is about to set its xmin does not compute an xmin
 less than what ComputeXidHorizons determines.  We ensure that by including
 all the active XIDs into the MIN() calculation, along with the valid xmins.
@@ -310,27 +310,27 @@ ensures that concurrent holders of shared ProcArrayLock will compute the
 same minimum of currently-active XIDs: no xact, in particular not the
 oldest, can exit while we hold shared ProcArrayLock.  So
 ComputeXidHorizons's view of the minimum active XID will be the same as that
-of any concurrent GetSnapshotData, and so it can't produce an overestimate.
+of any concurrent GetMVCCSnapshotData, and so it can't produce an overestimate.
 If there is no active transaction at all, ComputeXidHorizons uses
 latestCompletedXid + 1, which is a lower bound for the xmin that might
-be computed by concurrent or later GetSnapshotData calls.  (We know that no
+be computed by concurrent or later GetMVCCSnapshotData calls.  (We know that no
 XID less than this could be about to appear in the ProcArray, because of the
 XidGenLock interlock discussed above.)
 
-As GetSnapshotData is performance critical, it does not perform an accurate
+As GetMVCCSnapshotData is performance critical, it does not perform an accurate
 oldest-xmin calculation (it used to, until v14). The contents of a snapshot
 only depend on the xids of other backends, not their xmin. As backend's xmin
-changes much more often than its xid, having GetSnapshotData look at xmins
+changes much more often than its xid, having GetMVCCSnapshotData look at xmins
 can lead to a lot of unnecessary cacheline ping-pong.  Instead
-GetSnapshotData updates approximate thresholds (one that guarantees that all
-deleted rows older than it can be removed, another determining that deleted
+GetMVCCSnapshotData updates approximate thresholds (one that guarantees that
+all deleted rows older than it can be removed, another determining that deleted
 rows newer than it can not be removed). GlobalVisTest* uses those thresholds
 to make invisibility decision, falling back to ComputeXidHorizons if
 necessary.
 
 Note that while it is certain that two concurrent executions of
-GetSnapshotData will compute the same xmin for their own snapshots, there is
-no such guarantee for the horizons computed by ComputeXidHorizons.  This is
+GetMVCCSnapshotData will compute the same xmin for their own snapshots, there
+is no such guarantee for the horizons computed by ComputeXidHorizons.  This is
 because we allow XID-less transactions to clear their MyProc->xmin
 asynchronously (without taking ProcArrayLock), so one execution might see
 what had been the oldest xmin, and another not.  This is OK since the
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index b658601bf77..f1148dbe4a3 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -143,12 +143,12 @@ find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
 			if (omit_detached && ActiveSnapshotSet())
 			{
 				TransactionId xmin;
-				Snapshot	snap;
+				MVCCSnapshot snap;
 
 				xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
-				snap = GetActiveSnapshot();
+				snap = (MVCCSnapshot) GetActiveSnapshot();
 
-				if (!XidInMVCCSnapshot(xmin, (MVCCSnapshot) snap))
+				if (!XidInMVCCSnapshot(xmin, snap->shared))
 				{
 					if (detached_xmin)
 					{
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 1ffb6f5fa70..037ca6c5444 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2043,7 +2043,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 		/* Ignore messages destined for other databases */
 		if (qe->dboid == MyDatabaseId)
 		{
-			if (XidInMVCCSnapshot(qe->xid, (MVCCSnapshot) snapshot))
+			if (XidInMVCCSnapshot(qe->xid, ((MVCCSnapshot) snapshot)->shared))
 			{
 				/*
 				 * The source transaction is still in progress, so we can't
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index da3e02398bb..7fa044f6f1c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1761,7 +1761,7 @@ DefineIndex(Oid tableId,
 	 * they must wait for.  But first, save the snapshot's xmin to use as
 	 * limitXmin for GetCurrentVirtualXIDs().
 	 */
-	limitXmin = snapshot->mvcc.xmin;
+	limitXmin = snapshot->mvcc.shared->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
@@ -4156,7 +4156,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * We can now do away with our active snapshot, we still need to save
 		 * the xmin limit to wait for older snapshots.
 		 */
-		limitXmin = snapshot->mvcc.xmin;
+		limitXmin = snapshot->mvcc.shared->xmin;
 
 		PopActiveSnapshot();
 		UnregisterSnapshot(snapshot);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c55b5a7a014..9aca810f9d5 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -20797,7 +20797,7 @@ ATExecDetachPartitionFinalize(Relation rel, RangeVar *name)
 	 * all such queries are complete (otherwise we would present them with an
 	 * inconsistent view of catalogs).
 	 */
-	WaitForOlderSnapshots(snap->mvcc.xmin, false);
+	WaitForOlderSnapshots(snap->mvcc.shared->xmin, false);
 
 	DetachPartitionFinalize(rel, partRel, true, InvalidOid);
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2da848970be..9ee10050873 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -157,8 +157,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	Assert(queryDesc != NULL);
 	Assert(queryDesc->estate == NULL);
 
-	/* caller must ensure the query's snapshot is active */
-	Assert(GetActiveSnapshot() == queryDesc->snapshot);
+	/* ensure the query's snapshot is active */
+	PushActiveSnapshot(queryDesc->snapshot);
 
 	/*
 	 * If the transaction is read-only, we need to check if any writes are
@@ -272,6 +272,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 
 	MemoryContextSwitchTo(oldcontext);
 
+	PopActiveSnapshot();
+
 	return ExecPlanStillValid(queryDesc->estate);
 }
 
@@ -390,8 +392,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	Assert(!estate->es_aborted);
 	Assert(!(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY));
 
-	/* caller must ensure the query's snapshot is active */
-	Assert(GetActiveSnapshot() == estate->es_snapshot);
+	/* ensure the query's snapshot is active */
+	PushActiveSnapshot(estate->es_snapshot);
 
 	/*
 	 * Switch into per-query memory context
@@ -455,6 +457,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	PopActiveSnapshot();
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 39c990ae638..af3f8f28144 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -737,7 +737,8 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	 * worker, which uses it to set es_snapshot.  Make sure we don't set
 	 * es_snapshot differently in the child.
 	 */
-	Assert(GetActiveSnapshot() == estate->es_snapshot);
+	Assert(((MVCCSnapshot) GetActiveSnapshot())->shared == ((MVCCSnapshot) estate->es_snapshot)->shared);
+	Assert(((MVCCSnapshot) GetActiveSnapshot())->curcid == ((MVCCSnapshot) estate->es_snapshot)->curcid);
 
 	/* Everyone's had a chance to ask for space, so now create the DSM. */
 	InitializeParallelDSM(pcxt);
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 7c15c634181..c5000b37b87 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -102,7 +102,7 @@ RelationGetPartitionDesc(Relation rel, bool omit_detached)
 		Assert(TransactionIdIsValid(rel->rd_partdesc_nodetached_xmin));
 		activesnap = GetActiveSnapshot();
 
-		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, &activesnap->mvcc))
+		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, activesnap->mvcc.shared))
 			return rel->rd_partdesc_nodetached;
 	}
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 50dca7cb758..3c94a62cdf6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -389,6 +389,12 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
  *
  * The snapshot will be usable directly in current transaction or exported
  * for loading in different transaction.
+ *
+ * XXX: The snapshot manager doesn't know anything about the returned
+ * snapshot.  It does not hold back MyProc->xmin, nor is it registered with
+ * any resource owner.  There's also no good way to free it, but leaking it is
+ * acceptable for the current usage where only one snapshot is build for the
+ * whole session.
  */
 MVCCSnapshot
 SnapBuildInitialSnapshot(SnapBuild *builder)
@@ -440,11 +446,14 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	MyProc->xmin = historicsnap->xmin;
 
 	/* allocate in transaction context */
-	mvccsnap = palloc(sizeof(MVCCSnapshotData) + sizeof(TransactionId) * GetMaxSnapshotXidCount());
+	mvccsnap = palloc(sizeof(MVCCSnapshotData));
+	mvccsnap->kind = SNAPSHOT_STATIC;
+	mvccsnap->shared = AllocMVCCSnapshotShared();
+	mvccsnap->shared->refcount = 1;
 	mvccsnap->snapshot_type = SNAPSHOT_MVCC;
-	mvccsnap->xmin = historicsnap->xmin;
-	mvccsnap->xmax = historicsnap->xmax;
-	mvccsnap->xip = (TransactionId *) ((char *) mvccsnap + sizeof(MVCCSnapshotData));
+	mvccsnap->shared->xmin = historicsnap->xmin;
+	mvccsnap->shared->xmax = historicsnap->xmax;
+	mvccsnap->shared->xip = (TransactionId *) ((char *) mvccsnap->shared + sizeof(MVCCSnapshotData));
 
 	/*
 	 * snapbuild.c builds transactions in an "inverted" manner, which means it
@@ -470,23 +479,20 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("initial slot snapshot too large")));
 
-			mvccsnap->xip[newxcnt++] = xid;
+			mvccsnap->shared->xip[newxcnt++] = xid;
 		}
 
 		TransactionIdAdvance(xid);
 	}
-	mvccsnap->xcnt = newxcnt;
+	mvccsnap->shared->xcnt = newxcnt;
 
 	/* Initialize remaining MVCCSnapshot fields */
-	mvccsnap->subxip = NULL;
-	mvccsnap->subxcnt = 0;
-	mvccsnap->suboverflowed = false;
-	mvccsnap->takenDuringRecovery = false;
-	mvccsnap->copied = true;
+	mvccsnap->shared->subxip = NULL;
+	mvccsnap->shared->subxcnt = 0;
+	mvccsnap->shared->suboverflowed = false;
+	mvccsnap->shared->takenDuringRecovery = false;
+	mvccsnap->shared->snapXactCompletionCount = 0;
 	mvccsnap->curcid = FirstCommandId;
-	mvccsnap->active_count = 0;
-	mvccsnap->regd_count = 0;
-	mvccsnap->snapXactCompletionCount = 0;
 
 	pfree(historicsnap);
 
@@ -528,13 +534,13 @@ SnapBuildExportSnapshot(SnapBuild *builder)
 	 * now that we've built a plain snapshot, make it active and use the
 	 * normal mechanisms for exporting it
 	 */
-	snapname = ExportSnapshot(snap);
+	snapname = ExportSnapshot(snap->shared);
 
 	ereport(LOG,
 			(errmsg_plural("exported logical decoding snapshot: \"%s\" with %u transaction ID",
 						   "exported logical decoding snapshot: \"%s\" with %u transaction IDs",
-						   snap->xcnt,
-						   snapname, snap->xcnt)));
+						   snap->shared->xcnt,
+						   snapname, snap->shared->xcnt)));
 	return snapname;
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1a7a35e25eb..513449ea9de 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2620,7 +2620,7 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetSnapshotData() /
+	 * the xmin will be taken into account by GetMVCCSnapshotData() /
 	 * ComputeXidHorizons().  This will hold back the removal of dead rows and
 	 * thereby prevent the generation of cleanup conflicts on the standby
 	 * server.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ba5ed8960dd..819649741f6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -62,6 +62,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
@@ -105,7 +106,7 @@ typedef struct ProcArrayStruct
  * MVCC semantics: If the deleted row's xmax is not considered to be running
  * by anyone, the row can be removed.
  *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * To avoid slowing down GetMVCCSnapshotData(), we don't calculate a precise
  * cutoff XID while building a snapshot (looking at the frequently changing
  * xmins scales badly). Instead we compute two boundaries while building the
  * snapshot:
@@ -159,7 +160,7 @@ typedef struct ProcArrayStruct
  *
  * The boundaries are FullTransactionIds instead of TransactionIds to avoid
  * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * prevent maybe_needed to become old enough after the GetMVCCSnapshotData()
  * call.
  *
  * The typedef is in the header.
@@ -386,7 +387,7 @@ ProcArrayShmemSize(void)
 	/*
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetSnapshotData(),
+	 * also created in various backends during GetMVCCSnapshotData(),
 	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
 	 * main structures created in those functions must be identically sized,
 	 * since we may at times copy the whole of the data structures around. We
@@ -938,7 +939,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 
 	/*
 	 * Need to increment completion count even though transaction hasn't
-	 * really committed yet. The reason for that is that GetSnapshotData()
+	 * really committed yet. The reason for that is that GetMVCCSnapshotData()
 	 * omits the xid of the current transaction, thus without the increment we
 	 * otherwise could end up reusing the snapshot later. Which would be bad,
 	 * because it might not count the prepared transaction as running.
@@ -2083,7 +2084,7 @@ GetMaxSnapshotSubxidCount(void)
 }
 
 /*
- * Helper function for GetSnapshotData() that checks if the bulk of the
+ * Helper function for GetMVCCSnapshotData() that checks if the bulk of the
  * visibility information in the snapshot is still valid. If so, it updates
  * the fields that need to change and returns true. Otherwise it returns
  * false.
@@ -2092,7 +2093,7 @@ GetMaxSnapshotSubxidCount(void)
  * least in the case we already hold a snapshot), but that's for another day.
  */
 static bool
-GetSnapshotDataReuse(MVCCSnapshot snapshot)
+GetMVCCSnapshotDataReuse(MVCCSnapshotShared snapshot)
 {
 	uint64		curXactCompletionCount;
 
@@ -2112,17 +2113,18 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
 	 * contents:
 	 *
 	 * As explained in transam/README, the set of xids considered running by
-	 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
-	 * contents only depend on transactions with xids and xactCompletionCount
-	 * is incremented whenever a transaction with an xid finishes (while
-	 * holding ProcArrayLock exclusively). Thus the xactCompletionCount check
-	 * ensures we would detect if the snapshot would have changed.
+	 * GetMVCCSnapshotData() cannot change while ProcArrayLock is held.
+	 * Snapshot contents only depend on transactions with xids and
+	 * xactCompletionCount is incremented whenever a transaction with an xid
+	 * finishes (while holding ProcArrayLock exclusively). Thus the
+	 * xactCompletionCount check ensures we would detect if the snapshot would
+	 * have changed.
 	 *
 	 * As the snapshot contents are the same as it was before, it is safe to
 	 * re-enter the snapshot's xmin into the PGPROC array. None of the rows
 	 * visible under the snapshot could already have been removed (that'd
 	 * require the set of running transactions to change) and it fulfills the
-	 * requirement that concurrent GetSnapshotData() calls yield the same
+	 * requirement that concurrent GetMVCCSnapshotData() calls yield the same
 	 * xmin.
 	 */
 	if (!TransactionIdIsValid(MyProc->xmin))
@@ -2131,17 +2133,11 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
 	RecentXmin = snapshot->xmin;
 	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
-	snapshot->curcid = GetCurrentCommandId(false);
-	snapshot->active_count = 0;
-	snapshot->regd_count = 0;
-	snapshot->copied = false;
-	snapshot->valid = true;
-
 	return true;
 }
 
 /*
- * GetSnapshotData -- returns information about running transactions.
+ * GetMVCCSnapshotData -- returns information about running transactions.
  *
  * The returned snapshot includes xmin (lowest still-running xact ID),
  * xmax (highest completed xact ID + 1), and a list of running xact IDs
@@ -2168,12 +2164,9 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
  *
  * And try to advance the bounds of GlobalVis{Shared,Catalog,Data,Temp}Rels
  * for the benefit of the GlobalVisTest* family of functions.
- *
- * Note: this function should probably not be called with an argument that's
- * not statically allocated (see xip allocation below).
  */
-MVCCSnapshot
-GetSnapshotData(MVCCSnapshot snapshot)
+MVCCSnapshotShared
+GetMVCCSnapshotData(void)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2187,43 +2180,34 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
+	MVCCSnapshotShared snapshot;
 
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
-	Assert(snapshot != NULL);
-
-	/*
-	 * Allocating space for maxProcs xids is usually overkill; numProcs would
-	 * be sufficient.  But it seems better to do the malloc while not holding
-	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
-	 * more subxip storage than is probably needed.
+	/*---
+	 * Allocate an MVCCSnapshotShared struct.  There are three cases:
+	 *
+	 * 1. No transactions have completed since the last call: we can reuse the
+	 *    latest snapshot information.  See GetMVCCSnapshotDataReuse().
+	 *
+	 * 2. Need to recalculate the snapshot, and 'latestSnapshotShared' is not
+	 *    currently in use by any snapshot.  We can overwrite its contents.
+	 *
+	 * 3. Need to recalculate the XID list and 'latestSnapshotShared' is still
+	 *    in use.  We need to allocate a new MVCCSnapshotShared struct.
 	 *
-	 * This does open a possibility for avoiding repeated malloc/free: since
-	 * maxProcs does not change at runtime, we can simply reuse the previous
-	 * xip arrays if any.  (This relies on the fact that all callers pass
-	 * static SnapshotData structs.)
+	 * We don't know if 'latestSnapshotShared' can be reused before we acquire
+	 * the lock, but if we do need to allocate, we want to do it before
+	 * acquiring the lock.  Therefore, we always make the allocation if we
+	 * might need it and if it turns out to have been unnecessary, we stash
+	 * away the allocated struct in 'spareSnapshotShared' to be reused on next
+	 * call.  This way, the unnecessary allocation is very cheap.
 	 */
-	if (snapshot->xip == NULL)
-	{
-		/*
-		 * First call for this snapshot. Snapshot is same size whether or not
-		 * we are in recovery, see later comments.
-		 */
-		snapshot->xip = (TransactionId *)
-			malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
-		if (snapshot->xip == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-		Assert(snapshot->subxip == NULL);
-		snapshot->subxip = (TransactionId *)
-			malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
-		if (snapshot->subxip == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-	}
+	if (latestSnapshotShared && latestSnapshotShared->refcount == 0)
+		snapshot = latestSnapshotShared;	/* case 1 or 2 */
+	else
+		snapshot = AllocMVCCSnapshotShared();	/* case 1 or 3 */
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
@@ -2231,10 +2215,14 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	if (GetSnapshotDataReuse(snapshot))
+	if (latestSnapshotShared && GetMVCCSnapshotDataReuse(latestSnapshotShared))
 	{
 		LWLockRelease(ProcArrayLock);
-		return snapshot;
+
+		/* if we made an allocation, stash it away for next call */
+		if (snapshot != latestSnapshotShared)
+			spareSnapshotShared = snapshot;
+		return latestSnapshotShared;
 	}
 
 	latest_completed = TransamVariables->latestCompletedXid;
@@ -2506,16 +2494,18 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	snapshot->suboverflowed = suboverflowed;
 	snapshot->snapXactCompletionCount = curXactCompletionCount;
 
-	snapshot->curcid = GetCurrentCommandId(false);
-
 	/*
-	 * This is a new snapshot, so set both refcounts are zero, and mark it as
-	 * not copied in persistent memory.
+	 * If we allocated a new struct for this, remember that it is the latest
+	 * now and adjust the refcounts accordingly.
 	 */
-	snapshot->active_count = 0;
-	snapshot->regd_count = 0;
-	snapshot->copied = false;
-	snapshot->valid = true;
+	if (snapshot != latestSnapshotShared)
+	{
+		Assert(snapshot->refcount == 0);
+
+		if (latestSnapshotShared && latestSnapshotShared->refcount == 0)
+			FreeMVCCSnapshotShared(latestSnapshotShared);
+		latestSnapshotShared = snapshot;
+	}
 
 	return snapshot;
 }
@@ -2585,10 +2575,10 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 			continue;
 
 		/*
-		 * We're good.  Install the new xmin.  As in GetSnapshotData, set
+		 * We're good.  Install the new xmin.  As in GetMVCCSnapshotData, set
 		 * TransactionXmin too.  (Note that because snapmgr.c called
-		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
-		 * we don't check that.)
+		 * GetMVCCSnapshotData first, we'll be overwriting a valid xmin here,
+		 * so we don't check that.)
 		 */
 		MyProc->xmin = TransactionXmin = xmin;
 
@@ -2659,7 +2649,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 /*
  * GetRunningTransactionData -- returns information about running transactions.
  *
- * Similar to GetSnapshotData but returns more information. We include
+ * Similar to GetMVCCSnapshotData but returns more information. We include
  * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
@@ -2681,7 +2671,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * entries here to not hold on ProcArrayLock more than necessary.
  *
  * We don't worry about updating other counters, we want to keep this as
- * simple as possible and leave GetSnapshotData() as the primary code for
+ * simple as possible and leave GetMVCCSnapshotData() as the primary code for
  * that bookkeeping.
  *
  * Note that if any transaction has overflowed its cached subtransactions
@@ -2866,8 +2856,8 @@ GetRunningTransactionData(void)
 /*
  * GetOldestActiveTransactionId()
  *
- * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGPROCs with an assigned TransactionId, even VACUUM processes.
+ * Similar to GetMVCCSnapshotData but returns just oldestActiveXid. We
+ * include all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2875,7 +2865,7 @@ GetRunningTransactionData(void)
  * KnownAssignedXids.
  *
  * We don't worry about updating other counters, we want to keep this as
- * simple as possible and leave GetSnapshotData() as the primary code for
+ * simple as possible and leave GetMVCCSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
@@ -4356,7 +4346,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
  * During hot standby we do not fret too much about the distinction between
  * top-level XIDs and subtransaction XIDs. We store both together in the
  * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
+ * GetMVCCSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
  * doesn't care about the distinction either.  Subtransaction XIDs are
  * effectively treated as top-level XIDs and in the typical case pg_subtrans
  * links are *not* maintained (which does not affect visibility).
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index dd52782ff22..edc6b9de7ca 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -449,10 +449,10 @@ static void SerialSetActiveSerXmin(TransactionId xid);
 
 static uint32 predicatelock_hash(const void *key, Size keysize);
 static void SummarizeOldestCommittedSxact(void);
-static MVCCSnapshot GetSafeSnapshot(MVCCSnapshot origSnapshot);
-static MVCCSnapshot GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
-														  VirtualTransactionId *sourcevxid,
-														  int sourcepid);
+static MVCCSnapshotShared GetSafeSnapshot(void);
+static MVCCSnapshotShared GetSerializableTransactionSnapshotInt(VirtualTransactionId *sourcevxid,
+																TransactionId sourcexmin,
+																int sourcepid);
 static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
 static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
 									  PREDICATELOCKTARGETTAG *parent);
@@ -1542,25 +1542,20 @@ SummarizeOldestCommittedSxact(void)
  *
  *		As with GetSerializableTransactionSnapshot (which this is a subroutine
  *		for), the passed-in Snapshot pointer should reference a static data
- *		area that can safely be passed to GetSnapshotData.
+ *		area that can safely be passed to GetMVCCSnapshotData.
  */
-static MVCCSnapshot
-GetSafeSnapshot(MVCCSnapshot origSnapshot)
+static MVCCSnapshotShared
+GetSafeSnapshot(void)
 {
-	MVCCSnapshot snapshot;
+	MVCCSnapshotShared snapshot;
 
 	Assert(XactReadOnly && XactDeferrable);
 
 	while (true)
 	{
-		/*
-		 * GetSerializableTransactionSnapshotInt is going to call
-		 * GetSnapshotData, so we need to provide it the static snapshot area
-		 * our caller passed to us.  The pointer returned is actually the same
-		 * one passed to it, but we avoid assuming that here.
-		 */
-		snapshot = GetSerializableTransactionSnapshotInt(origSnapshot,
-														 NULL, InvalidPid);
+		snapshot = GetSerializableTransactionSnapshotInt(NULL,
+														 InvalidTransactionId,
+														 InvalidPid);
 
 		if (MySerializableXact == InvalidSerializableXact)
 			return snapshot;	/* no concurrent r/w xacts; it's safe */
@@ -1663,13 +1658,11 @@ GetSafeSnapshotBlockingPids(int blocked_pid, int *output, int output_size)
  * Make sure we have a SERIALIZABLEXACT reference in MySerializableXact.
  * It should be current for this process and be contained in PredXact.
  *
- * The passed-in Snapshot pointer should reference a static data area that
- * can safely be passed to GetSnapshotData.  The return value is actually
- * always this same pointer; no new snapshot data structure is allocated
- * within this function.
+ * This calls GetMVCCSnapshotData to do the heavy lifting, but also sets up
+ * shared memory data structures specific to serializable transactions.
  */
-MVCCSnapshot
-GetSerializableTransactionSnapshot(MVCCSnapshot snapshot)
+MVCCSnapshotShared
+GetSerializableTransactionSnapshotData(void)
 {
 	Assert(IsolationIsSerializable());
 
@@ -1692,26 +1685,25 @@ GetSerializableTransactionSnapshot(MVCCSnapshot snapshot)
 	 * thereby avoid all SSI overhead once it's running.
 	 */
 	if (XactReadOnly && XactDeferrable)
-		return GetSafeSnapshot(snapshot);
+		return GetSafeSnapshot();
 
-	return GetSerializableTransactionSnapshotInt(snapshot,
-												 NULL, InvalidPid);
+	return GetSerializableTransactionSnapshotInt(NULL, InvalidTransactionId, InvalidPid);
 }
 
 /*
  * Import a snapshot to be used for the current transaction.
  *
- * This is nearly the same as GetSerializableTransactionSnapshot, except that
- * we don't take a new snapshot, but rather use the data we're handed.
+ * This is nearly the same as GetSerializableTransactionSnapshotData, except
+ * that we don't take a new snapshot, but rather use the data we're handed.
  *
  * The caller must have verified that the snapshot came from a serializable
  * transaction; and if we're read-write, the source transaction must not be
  * read-only.
  */
 void
-SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
-								   VirtualTransactionId *sourcevxid,
-								   int sourcepid)
+SetSerializableTransactionSnapshotData(MVCCSnapshotShared snapshot,
+									   VirtualTransactionId *sourcevxid,
+									   int sourcepid)
 {
 	Assert(IsolationIsSerializable());
 
@@ -1737,28 +1729,29 @@ SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("a snapshot-importing transaction must not be READ ONLY DEFERRABLE")));
 
-	(void) GetSerializableTransactionSnapshotInt(snapshot, sourcevxid,
-												 sourcepid);
+	(void) GetSerializableTransactionSnapshotInt(sourcevxid, snapshot->xmin, sourcepid);
 }
 
 /*
  * Guts of GetSerializableTransactionSnapshot
  *
  * If sourcevxid is valid, this is actually an import operation and we should
- * skip calling GetSnapshotData, because the snapshot contents are already
+ * skip calling GetMVCCSnapshotData, because the snapshot contents are already
  * loaded up.  HOWEVER: to avoid race conditions, we must check that the
  * source xact is still running after we acquire SerializableXactHashLock.
  * We do that by calling ProcArrayInstallImportedXmin.
  */
-static MVCCSnapshot
-GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
-									  VirtualTransactionId *sourcevxid,
+static MVCCSnapshotShared
+GetSerializableTransactionSnapshotInt(VirtualTransactionId *sourcevxid,
+									  TransactionId sourcexmin,
 									  int sourcepid)
 {
 	PGPROC	   *proc;
 	VirtualTransactionId vxid;
 	SERIALIZABLEXACT *sxact,
 			   *othersxact;
+	MVCCSnapshotShared snapshot;
+	TransactionId xmin;
 
 	/* We only do this for serializable transactions.  Once. */
 	Assert(MySerializableXact == InvalidSerializableXact);
@@ -1783,7 +1776,7 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 	 *
 	 * We must hold SerializableXactHashLock when taking/checking the snapshot
 	 * to avoid race conditions, for much the same reasons that
-	 * GetSnapshotData takes the ProcArrayLock.  Since we might have to
+	 * GetMVCCSnapshotData takes the ProcArrayLock.  Since we might have to
 	 * release SerializableXactHashLock to call SummarizeOldestCommittedSxact,
 	 * this means we have to create the sxact first, which is a bit annoying
 	 * (in particular, an elog(ERROR) in procarray.c would cause us to leak
@@ -1807,16 +1800,24 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 
 	/* Get the snapshot, or check that it's safe to use */
 	if (!sourcevxid)
-		snapshot = GetSnapshotData(snapshot);
-	else if (!ProcArrayInstallImportedXmin(snapshot->xmin, sourcevxid))
 	{
-		ReleasePredXact(sxact);
-		LWLockRelease(SerializableXactHashLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("could not import the requested snapshot"),
-				 errdetail("The source process with PID %d is not running anymore.",
-						   sourcepid)));
+		snapshot = GetMVCCSnapshotData();
+		xmin = snapshot->xmin;
+	}
+	else
+	{
+		if (!ProcArrayInstallImportedXmin(sourcexmin, sourcevxid))
+		{
+			ReleasePredXact(sxact);
+			LWLockRelease(SerializableXactHashLock);
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("could not import the requested snapshot"),
+					 errdetail("The source process with PID %d is not running anymore.",
+							   sourcepid)));
+		}
+		snapshot = NULL;
+		xmin = sourcexmin;
 	}
 
 	/*
@@ -1848,7 +1849,7 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 	dlist_init(&(sxact->possibleUnsafeConflicts));
 	sxact->topXid = GetTopTransactionIdIfAny();
 	sxact->finishedBefore = InvalidTransactionId;
-	sxact->xmin = snapshot->xmin;
+	sxact->xmin = xmin;
 	sxact->pid = MyProcPid;
 	sxact->pgprocno = MyProcNumber;
 	dlist_init(&sxact->predicateLocks);
@@ -1902,18 +1903,18 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 	if (!TransactionIdIsValid(PredXact->SxactGlobalXmin))
 	{
 		Assert(PredXact->SxactGlobalXminCount == 0);
-		PredXact->SxactGlobalXmin = snapshot->xmin;
+		PredXact->SxactGlobalXmin = xmin;
 		PredXact->SxactGlobalXminCount = 1;
-		SerialSetActiveSerXmin(snapshot->xmin);
+		SerialSetActiveSerXmin(xmin);
 	}
-	else if (TransactionIdEquals(snapshot->xmin, PredXact->SxactGlobalXmin))
+	else if (TransactionIdEquals(xmin, PredXact->SxactGlobalXmin))
 	{
 		Assert(PredXact->SxactGlobalXminCount > 0);
 		PredXact->SxactGlobalXminCount++;
 	}
 	else
 	{
-		Assert(TransactionIdFollows(snapshot->xmin, PredXact->SxactGlobalXmin));
+		Assert(TransactionIdFollows(xmin, PredXact->SxactGlobalXmin));
 	}
 
 	MySerializableXact = sxact;
@@ -3968,13 +3969,13 @@ XidIsConcurrent(TransactionId xid)
 
 	snap = (MVCCSnapshot) GetTransactionSnapshot();
 
-	if (TransactionIdPrecedes(xid, snap->xmin))
+	if (TransactionIdPrecedes(xid, snap->shared->xmin))
 		return false;
 
-	if (TransactionIdFollowsOrEquals(xid, snap->xmax))
+	if (TransactionIdFollowsOrEquals(xid, snap->shared->xmax))
 		return true;
 
-	return pg_lfind32(xid, snap->xip, snap->xcnt);
+	return pg_lfind32(xid, snap->shared->xip, snap->shared->xcnt);
 }
 
 bool
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index d4aa8ef9e4e..eef632390cb 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -380,7 +380,7 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 		elog(ERROR, "no active snapshot set");
 
 	/* allocate */
-	nxip = cur->xcnt;
+	nxip = cur->shared->xcnt;
 	snap = palloc(PG_SNAPSHOT_SIZE(nxip));
 
 	/*
@@ -389,12 +389,12 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	 * advance past any of these XIDs.  Hence, these XIDs remain allowable
 	 * relative to next_fxid.
 	 */
-	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->xmin);
-	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->xmax);
+	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->shared->xmin);
+	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->shared->xmax);
 	snap->nxip = nxip;
 	for (i = 0; i < nxip; i++)
 		snap->xip[i] =
-			FullTransactionIdFromAllowableAt(next_fxid, cur->xip[i]);
+			FullTransactionIdFromAllowableAt(next_fxid, cur->shared->xip[i]);
 
 	/*
 	 * We want them guaranteed to be in ascending order.  This also removes
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c39cc11609..5f9f2b9d8b2 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -122,9 +122,6 @@
  * special-purpose code (say, RI checking.)  CatalogSnapshot points to an
  * MVCC snapshot intended to be used for catalog scans; we must invalidate it
  * whenever a system catalog change occurs.
- *
- * These SnapshotData structs are static to simplify memory allocation
- * (see the hack in GetSnapshotData to avoid repeated malloc/free).
  */
 static MVCCSnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
 static MVCCSnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
@@ -137,7 +134,7 @@ SnapshotData SnapshotToastData = {SNAPSHOT_TOAST};
 static HistoricMVCCSnapshot HistoricSnapshot = NULL;
 
 /*
- * These are updated by GetSnapshotData.  We initialize them this way
+ * These are updated by GetMVCCSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
  */
@@ -150,14 +147,12 @@ static HTAB *tuplecid_data = NULL;
 /*
  * Elements of the active snapshot stack.
  *
- * Each element here accounts for exactly one active_count on SnapshotData.
- *
  * NB: the code assumes that elements in this list are in non-increasing
  * order of as_level; also, the list must be NULL-terminated.
  */
 typedef struct ActiveSnapshotElt
 {
-	MVCCSnapshot as_snap;
+	MVCCSnapshotData as_snap;
 	int			as_level;
 	struct ActiveSnapshotElt *as_next;
 } ActiveSnapshotElt;
@@ -188,19 +183,23 @@ static bool FirstXactSnapshotRegistered = false;
 typedef struct ExportedSnapshot
 {
 	char	   *snapfile;
-	MVCCSnapshot snapshot;
+	MVCCSnapshotShared snapshot;
 } ExportedSnapshot;
 
 /* Current xact's exported snapshots (a list of ExportedSnapshot structs) */
 static List *exportedSnapshots = NIL;
 
+MVCCSnapshotShared latestSnapshotShared = NULL;
+MVCCSnapshotShared spareSnapshotShared = NULL;
+
 /* Prototypes for local functions */
-static MVCCSnapshot CopyMVCCSnapshot(MVCCSnapshot snapshot);
+static void UpdateStaticMVCCSnapshot(MVCCSnapshot snapshot, MVCCSnapshotShared shared);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeMVCCSnapshot(MVCCSnapshot snapshot);
 static void SnapshotResetXmin(void);
-static void valid_snapshots_push_tail(MVCCSnapshot snapshot);
-static void valid_snapshots_push_out_of_order(MVCCSnapshot snapshot);
+static void ReleaseMVCCSnapshotShared(MVCCSnapshotShared shared);
+static void valid_snapshots_push_tail(MVCCSnapshotShared snapshot);
+static void valid_snapshots_push_out_of_order(MVCCSnapshotShared snapshot);
+
 
 /* ResourceOwner callbacks to track snapshot references */
 static void ResOwnerReleaseSnapshot(Datum res);
@@ -266,6 +265,8 @@ GetTransactionSnapshot(void)
 	/* First call in transaction? */
 	if (!FirstSnapshotSet)
 	{
+		MVCCSnapshotShared shared;
+
 		/*
 		 * Don't allow catalog snapshot to be older than xact snapshot.  Must
 		 * do this first to allow the empty-heap Assert to succeed.
@@ -287,23 +288,18 @@ GetTransactionSnapshot(void)
 		 * mode, predicate.c needs to wrap the snapshot fetch in its own
 		 * processing.
 		 */
+		if (IsolationIsSerializable())
+			shared = GetSerializableTransactionSnapshotData();
+		else
+			shared = GetMVCCSnapshotData();
+
+		UpdateStaticMVCCSnapshot(&CurrentSnapshotData, shared);
+
 		if (IsolationUsesXactSnapshot())
 		{
-			/* First, create the snapshot in CurrentSnapshotData */
-			if (IsolationIsSerializable())
-				GetSerializableTransactionSnapshot(&CurrentSnapshotData);
-			else
-				GetSnapshotData(&CurrentSnapshotData);
-
-			/* Mark it as "registered" */
+			/* keep it */
 			FirstXactSnapshotRegistered = true;
 		}
-		else
-		{
-			GetSnapshotData(&CurrentSnapshotData);
-		}
-		valid_snapshots_push_tail(&CurrentSnapshotData);
-
 		FirstSnapshotSet = true;
 		return (Snapshot) &CurrentSnapshotData;
 	}
@@ -318,14 +314,31 @@ GetTransactionSnapshot(void)
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
-	if (CurrentSnapshotData.valid)
-		dlist_delete(&CurrentSnapshotData.node);
-	GetSnapshotData(&CurrentSnapshotData);
-	valid_snapshots_push_tail(&CurrentSnapshotData);
-
+	UpdateStaticMVCCSnapshot(&CurrentSnapshotData, GetMVCCSnapshotData());
 	return (Snapshot) &CurrentSnapshotData;
 }
 
+/*
+ * Update a static snapshot with the given shared struct.
+ *
+ * If the static snapshot is previously valid, release its old 'shared'
+ * struct first.
+ */
+static void
+UpdateStaticMVCCSnapshot(MVCCSnapshot snapshot, MVCCSnapshotShared shared)
+{
+	/* Replace the 'shared' struct */
+	if (snapshot->shared)
+		ReleaseMVCCSnapshotShared(snapshot->shared);
+	snapshot->shared = shared;
+	snapshot->shared->refcount++;
+	if (snapshot->shared->refcount == 1)
+		valid_snapshots_push_tail(shared);
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->valid = true;
+}
+
 /*
  * GetLatestSnapshot
  *		Get a snapshot that is up-to-date as of the current instant,
@@ -352,10 +365,7 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
-	if (SecondarySnapshotData.valid)
-		dlist_delete(&SecondarySnapshotData.node);
-	GetSnapshotData(&SecondarySnapshotData);
-	valid_snapshots_push_tail(&SecondarySnapshotData);
+	UpdateStaticMVCCSnapshot(&SecondarySnapshotData, GetMVCCSnapshotData());
 
 	return (Snapshot) &SecondarySnapshotData;
 }
@@ -405,7 +415,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 	if (!CatalogSnapshotData.valid)
 	{
 		/* Get new snapshot. */
-		GetSnapshotData(&CatalogSnapshotData);
+		UpdateStaticMVCCSnapshot(&CatalogSnapshotData, GetMVCCSnapshotData());
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
@@ -419,7 +429,6 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		valid_snapshots_push_tail(&CatalogSnapshotData);
 	}
 
 	return (Snapshot) &CatalogSnapshotData;
@@ -440,17 +449,20 @@ InvalidateCatalogSnapshot(void)
 {
 	if (CatalogSnapshotData.valid)
 	{
-		dlist_delete(&CatalogSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CatalogSnapshotData.shared);
+		CatalogSnapshotData.shared = NULL;
 		CatalogSnapshotData.valid = false;
 	}
 	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
 	{
-		dlist_delete(&CurrentSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+		CurrentSnapshotData.shared = NULL;
 		CurrentSnapshotData.valid = false;
 	}
 	if (SecondarySnapshotData.valid)
 	{
-		dlist_delete(&SecondarySnapshotData.node);
+		ReleaseMVCCSnapshotShared(SecondarySnapshotData.shared);
+		SecondarySnapshotData.shared = NULL;
 		SecondarySnapshotData.valid = false;
 	}
 
@@ -465,13 +477,14 @@ InvalidateCatalogSnapshot(void)
  * want to continue holding the catalog snapshot if it might mean that the
  * global xmin horizon can't advance.  However, if there are other snapshots
  * still active or registered, the catalog snapshot isn't likely to be the
- * oldest one, so we might as well keep it.
+ * oldest one, so we might as well keep it. XXX
  */
 void
 InvalidateCatalogSnapshotConditionally(void)
 {
 	if (CatalogSnapshotData.valid &&
-		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node)
+		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.shared->node &&
+		CatalogSnapshotData.shared->refcount == 1)
 		InvalidateCatalogSnapshot();
 }
 
@@ -501,7 +514,7 @@ SnapshotSetCommandId(CommandId curcid)
  * in GetTransactionSnapshot.
  */
 static void
-SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid,
+SetTransactionSnapshot(MVCCSnapshotShared sourcesnap, VirtualTransactionId *sourcevxid,
 					   int sourcepid, PGPROC *sourceproc)
 {
 	/* Caller should have checked this already */
@@ -512,38 +525,25 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 
 	Assert(!FirstXactSnapshotRegistered);
 	Assert(!HistoricSnapshotActive());
+	Assert(sourcesnap->refcount > 0);
 
 	/*
 	 * Even though we are not going to use the snapshot it computes, we must
-	 * call GetSnapshotData, for two reasons: (1) to be sure that
-	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * the state for GlobalVis*.
+	 * call GetMVCCSnapshotData to update the state for GlobalVis*.
 	 */
-	GetSnapshotData(&CurrentSnapshotData);
+	UpdateStaticMVCCSnapshot(&CurrentSnapshotData, GetMVCCSnapshotData());
 
 	/*
 	 * Now copy appropriate fields from the source snapshot.
 	 */
-	CurrentSnapshotData.xmin = sourcesnap->xmin;
-	CurrentSnapshotData.xmax = sourcesnap->xmax;
-	CurrentSnapshotData.xcnt = sourcesnap->xcnt;
-	Assert(sourcesnap->xcnt <= GetMaxSnapshotXidCount());
-	if (sourcesnap->xcnt > 0)
-		memcpy(CurrentSnapshotData.xip, sourcesnap->xip,
-			   sourcesnap->xcnt * sizeof(TransactionId));
-	CurrentSnapshotData.subxcnt = sourcesnap->subxcnt;
-	Assert(sourcesnap->subxcnt <= GetMaxSnapshotSubxidCount());
-	if (sourcesnap->subxcnt > 0)
-		memcpy(CurrentSnapshotData.subxip, sourcesnap->subxip,
-			   sourcesnap->subxcnt * sizeof(TransactionId));
-	CurrentSnapshotData.suboverflowed = sourcesnap->suboverflowed;
-	CurrentSnapshotData.takenDuringRecovery = sourcesnap->takenDuringRecovery;
-	/* NB: curcid should NOT be copied, it's a local matter */
+	ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+	CurrentSnapshotData.shared = sourcesnap;
+	CurrentSnapshotData.shared->refcount++;
 
-	CurrentSnapshotData.snapXactCompletionCount = 0;
+	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
+	 * Now we have to fix what GetMVCCSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -555,13 +555,13 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	 */
 	if (sourceproc != NULL)
 	{
-		if (!ProcArrayInstallRestoredXmin(CurrentSnapshotData.xmin, sourceproc))
+		if (!ProcArrayInstallRestoredXmin(CurrentSnapshotData.shared->xmin, sourceproc))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("could not import the requested snapshot"),
 					 errdetail("The source transaction is not running anymore.")));
 	}
-	else if (!ProcArrayInstallImportedXmin(CurrentSnapshotData.xmin, sourcevxid))
+	else if (!ProcArrayInstallImportedXmin(CurrentSnapshotData.shared->xmin, sourcevxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("could not import the requested snapshot"),
@@ -577,96 +577,22 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	if (IsolationUsesXactSnapshot())
 	{
 		if (IsolationIsSerializable())
-			SetSerializableTransactionSnapshot(&CurrentSnapshotData, sourcevxid,
-											   sourcepid);
-		/* Mark it as "registered" */
+			SetSerializableTransactionSnapshotData(CurrentSnapshotData.shared,
+												   sourcevxid, sourcepid);
+		/* keep it */
 		FirstXactSnapshotRegistered = true;
 	}
-	valid_snapshots_push_tail(&CurrentSnapshotData);
 
 	FirstSnapshotSet = true;
 }
 
-/*
- * CopyMVCCSnapshot
- *		Copy the given snapshot.
- *
- * The copy is palloc'd in TopTransactionContext and has initial refcounts set
- * to 0.  The returned snapshot has the copied flag set.
- */
-static MVCCSnapshot
-CopyMVCCSnapshot(MVCCSnapshot snapshot)
-{
-	MVCCSnapshot newsnap;
-	Size		subxipoff;
-	Size		size;
-
-	/* We allocate any XID arrays needed in the same palloc block. */
-	size = subxipoff = sizeof(MVCCSnapshotData) +
-		snapshot->xcnt * sizeof(TransactionId);
-	if (snapshot->subxcnt > 0)
-		size += snapshot->subxcnt * sizeof(TransactionId);
-
-	newsnap = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
-	memcpy(newsnap, snapshot, sizeof(MVCCSnapshotData));
-
-	newsnap->regd_count = 0;
-	newsnap->active_count = 0;
-	newsnap->copied = true;
-	newsnap->valid = true;
-	newsnap->snapXactCompletionCount = 0;
-
-	/* setup XID array */
-	if (snapshot->xcnt > 0)
-	{
-		newsnap->xip = (TransactionId *) (newsnap + 1);
-		memcpy(newsnap->xip, snapshot->xip,
-			   snapshot->xcnt * sizeof(TransactionId));
-	}
-	else
-		newsnap->xip = NULL;
-
-	/*
-	 * Setup subXID array. Don't bother to copy it if it had overflowed,
-	 * though, because it's not used anywhere in that case. Except if it's a
-	 * snapshot taken during recovery; all the top-level XIDs are in subxip as
-	 * well in that case, so we mustn't lose them.
-	 */
-	if (snapshot->subxcnt > 0 &&
-		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
-	{
-		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
-		memcpy(newsnap->subxip, snapshot->subxip,
-			   snapshot->subxcnt * sizeof(TransactionId));
-	}
-	else
-		newsnap->subxip = NULL;
-
-	return newsnap;
-}
-
-/*
- * FreeMVCCSnapshot
- *		Free the memory associated with a snapshot.
- */
-static void
-FreeMVCCSnapshot(MVCCSnapshot snapshot)
-{
-	Assert(snapshot->regd_count == 0);
-	Assert(snapshot->active_count == 0);
-	Assert(snapshot->copied);
-	Assert(snapshot->valid);
-
-	pfree(snapshot);
-}
-
 /*
  * PushActiveSnapshot
  *		Set the given snapshot as the current active snapshot
  *
  * If the passed snapshot is a statically-allocated one, or it is possibly
  * subject to a future command counter update, create a new long-lived copy
- * with active refcount=1.  Otherwise, only increment the refcount.
+ * with active refcount=1.  Otherwise, only increment the refcount. XXX
  *
  * Only regular MVCC snaphots can be used as the active snapshot.
  */
@@ -697,24 +623,13 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
 
 	newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt));
-
-	/*
-	 * Checking SecondarySnapshot is probably useless here, but it seems
-	 * better to be sure.
-	 */
-	if (!origsnap->copied)
-	{
-		newactive->as_snap = CopyMVCCSnapshot(origsnap);
-		dlist_insert_after(&origsnap->node, &newactive->as_snap->node);
-	}
-	else
-		newactive->as_snap = origsnap;
+	memcpy(&newactive->as_snap, origsnap, sizeof(MVCCSnapshotData));
+	newactive->as_snap.kind = SNAPSHOT_ACTIVE;
+	newactive->as_snap.shared->refcount++;
 
 	newactive->as_next = ActiveSnapshot;
 	newactive->as_level = snap_level;
 
-	newactive->as_snap->active_count++;
-
 	ActiveSnapshot = newactive;
 }
 
@@ -729,20 +644,20 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 void
 PushCopiedSnapshot(Snapshot snapshot)
 {
-	MVCCSnapshot copy;
-
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 
-	copy = CopyMVCCSnapshot(&snapshot->mvcc);
-	dlist_insert_after(&snapshot->mvcc.node, &copy->node);
-	PushActiveSnapshot((Snapshot) copy);
+	/*
+	 * This used to be different from PushActiveSnapshot, but these days
+	 * PushActiveSnapshot creates a copy too and there's no difference.
+	 */
+	PushActiveSnapshot(snapshot);
 }
 
 /*
  * UpdateActiveSnapshotCommandId
  *
  * Update the current CID of the active snapshot.  This can only be applied
- * to a snapshot that is not referenced elsewhere.
+ * to a snapshot that is not referenced elsewhere. XXX
  */
 void
 UpdateActiveSnapshotCommandId(void)
@@ -751,8 +666,6 @@ UpdateActiveSnapshotCommandId(void)
 				curcid;
 
 	Assert(ActiveSnapshot != NULL);
-	Assert(ActiveSnapshot->as_snap->active_count == 1);
-	Assert(ActiveSnapshot->as_snap->regd_count == 0);
 
 	/*
 	 * Don't allow modification of the active snapshot during parallel
@@ -762,11 +675,12 @@ UpdateActiveSnapshotCommandId(void)
 	 * CommandCounterIncrement, but there are a few places that call this
 	 * directly, so we put an additional guard here.
 	 */
-	save_curcid = ActiveSnapshot->as_snap->curcid;
+	save_curcid = ActiveSnapshot->as_snap.curcid;
 	curcid = GetCurrentCommandId(false);
 	if (IsInParallelMode() && save_curcid != curcid)
 		elog(ERROR, "cannot modify commandid in active snapshot during a parallel operation");
-	ActiveSnapshot->as_snap->curcid = curcid;
+
+	ActiveSnapshot->as_snap.curcid = curcid;
 }
 
 /*
@@ -782,16 +696,7 @@ PopActiveSnapshot(void)
 
 	newstack = ActiveSnapshot->as_next;
 
-	Assert(ActiveSnapshot->as_snap->active_count > 0);
-
-	ActiveSnapshot->as_snap->active_count--;
-
-	if (ActiveSnapshot->as_snap->active_count == 0 &&
-		ActiveSnapshot->as_snap->regd_count == 0)
-	{
-		dlist_delete(&ActiveSnapshot->as_snap->node);
-		FreeMVCCSnapshot(ActiveSnapshot->as_snap);
-	}
+	ReleaseMVCCSnapshotShared(ActiveSnapshot->as_snap.shared);
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
@@ -808,7 +713,7 @@ GetActiveSnapshot(void)
 {
 	Assert(ActiveSnapshot != NULL);
 
-	return (Snapshot) ActiveSnapshot->as_snap;
+	return (Snapshot) &ActiveSnapshot->as_snap;
 }
 
 /*
@@ -844,7 +749,7 @@ RegisterSnapshot(Snapshot snapshot)
 Snapshot
 RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 {
-	MVCCSnapshot snapshot;
+	MVCCSnapshot newsnap;
 
 	if (orig_snapshot == InvalidSnapshot)
 		return InvalidSnapshot;
@@ -861,22 +766,19 @@ RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 	}
 
 	Assert(orig_snapshot->snapshot_type == SNAPSHOT_MVCC);
-	snapshot = &orig_snapshot->mvcc;
-	Assert(snapshot->valid);
+	Assert(orig_snapshot->mvcc.valid);
 
-	/* Static snapshot?  Create a persistent copy */
-	if (!snapshot->copied)
-	{
-		snapshot = CopyMVCCSnapshot(snapshot);
-		dlist_insert_after(&orig_snapshot->mvcc.node, &snapshot->node);
-	}
+	/* Create a copy */
+	newsnap = MemoryContextAlloc(TopTransactionContext, sizeof(MVCCSnapshotData));
+	memcpy(newsnap, &orig_snapshot->mvcc, sizeof(MVCCSnapshotData));
+	newsnap->kind = SNAPSHOT_REGISTERED;
+	newsnap->shared->refcount++;
 
 	/* and tell resowner.c about it */
 	ResourceOwnerEnlarge(owner);
-	snapshot->regd_count++;
-	ResourceOwnerRememberSnapshot(owner, (Snapshot) snapshot);
+	ResourceOwnerRememberSnapshot(owner, (Snapshot) newsnap);
 
-	return (Snapshot) snapshot;
+	return (Snapshot) newsnap;
 }
 
 /*
@@ -914,18 +816,12 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 {
 	if (snapshot->snapshot_type == SNAPSHOT_MVCC)
 	{
-		MVCCSnapshot mvccsnap = &snapshot->mvcc;
-
-		Assert(mvccsnap->regd_count > 0);
+		Assert(snapshot->mvcc.kind == SNAPSHOT_REGISTERED);
 		Assert(!dlist_is_empty(&ValidSnapshots));
 
-		mvccsnap->regd_count--;
-		if (mvccsnap->regd_count == 0 && mvccsnap->active_count == 0)
-		{
-			dlist_delete(&mvccsnap->node);
-			FreeMVCCSnapshot(mvccsnap);
-			SnapshotResetXmin();
-		}
+		ReleaseMVCCSnapshotShared(snapshot->mvcc.shared);
+		pfree(snapshot);
+		SnapshotResetXmin();
 	}
 	else if (snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 	{
@@ -963,19 +859,21 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 static void
 SnapshotResetXmin(void)
 {
-	MVCCSnapshot minSnapshot;
+	MVCCSnapshotShared minSnapshot;
 
 	/*
 	 * Invalidate these static snapshots so that we can advance xmin.
 	 */
 	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
 	{
-		dlist_delete(&CurrentSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+		CurrentSnapshotData.shared = NULL;
 		CurrentSnapshotData.valid = false;
 	}
 	if (SecondarySnapshotData.valid)
 	{
-		dlist_delete(&SecondarySnapshotData.node);
+		ReleaseMVCCSnapshotShared(SecondarySnapshotData.shared);
+		SecondarySnapshotData.shared = NULL;
 		SecondarySnapshotData.valid = false;
 	}
 
@@ -988,7 +886,7 @@ SnapshotResetXmin(void)
 		return;
 	}
 
-	minSnapshot = dlist_head_element(MVCCSnapshotData, node, &ValidSnapshots);
+	minSnapshot = dlist_head_element(MVCCSnapshotSharedData, node, &ValidSnapshots);
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
 		MyProc->xmin = TransactionXmin = minSnapshot->xmin;
@@ -1028,21 +926,7 @@ AtSubAbort_Snapshot(int level)
 
 		next = ActiveSnapshot->as_next;
 
-		/*
-		 * Decrement the snapshot's active count.  If it's still registered or
-		 * marked as active by an outer subtransaction, we can't free it yet.
-		 */
-		Assert(ActiveSnapshot->as_snap->active_count >= 1);
-		ActiveSnapshot->as_snap->active_count -= 1;
-
-		if (ActiveSnapshot->as_snap->active_count == 0 &&
-			ActiveSnapshot->as_snap->regd_count == 0)
-		{
-			dlist_delete(&ActiveSnapshot->as_snap->node);
-			FreeMVCCSnapshot(ActiveSnapshot->as_snap);
-		}
-
-		/* and free the stack element */
+		ReleaseMVCCSnapshotShared(ActiveSnapshot->as_snap.shared);
 		pfree(ActiveSnapshot);
 
 		ActiveSnapshot = next;
@@ -1058,6 +942,8 @@ AtSubAbort_Snapshot(int level)
 void
 AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 {
+	dlist_mutable_iter iter;
+
 	/*
 	 * If we exported any snapshots, clean them up.
 	 */
@@ -1084,7 +970,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 				elog(WARNING, "could not unlink file \"%s\": %m",
 					 esnap->snapfile);
 
-			dlist_delete(&esnap->snapshot->node);
+			ReleaseMVCCSnapshotShared(esnap->snapshot);
 		}
 
 		exportedSnapshots = NIL;
@@ -1093,17 +979,20 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	/* Drop all static snapshot */
 	if (CatalogSnapshotData.valid)
 	{
-		dlist_delete(&CatalogSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CatalogSnapshotData.shared);
+		CatalogSnapshotData.shared = NULL;
 		CatalogSnapshotData.valid = false;
 	}
 	if (CurrentSnapshotData.valid)
 	{
-		dlist_delete(&CurrentSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+		CurrentSnapshotData.shared = NULL;
 		CurrentSnapshotData.valid = false;
 	}
 	if (SecondarySnapshotData.valid)
 	{
-		dlist_delete(&SecondarySnapshotData.node);
+		ReleaseMVCCSnapshotShared(SecondarySnapshotData.shared);
+		SecondarySnapshotData.shared = NULL;
 		SecondarySnapshotData.valid = false;
 	}
 
@@ -1124,11 +1013,23 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * And reset our state.  We don't need to free the memory explicitly --
 	 * it'll go away with TopTransactionContext.
 	 */
-	ActiveSnapshot = NULL;
-	dlist_init(&ValidSnapshots);
+	dlist_foreach_modify(iter, &ValidSnapshots)
+	{
+		MVCCSnapshotShared cur = dlist_container(MVCCSnapshotSharedData, node, iter.cur);
 
-	CurrentSnapshotData.valid = false;
-	SecondarySnapshotData.valid = false;
+		dlist_delete(iter.cur);
+		cur->refcount = 0;
+		if (cur == latestSnapshotShared)
+		{
+			/* keep it */
+		}
+		else if (spareSnapshotShared == NULL)
+			spareSnapshotShared = cur;
+		else
+			pfree(cur);
+	}
+
+	ActiveSnapshot = NULL;
 	FirstSnapshotSet = false;
 	FirstXactSnapshotRegistered = false;
 
@@ -1151,9 +1052,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
  *		snapshot.
  */
 char *
-ExportSnapshot(MVCCSnapshot snapshot)
+ExportSnapshot(MVCCSnapshotShared snapshot)
 {
-	MVCCSnapshot orig_snapshot;
 	TransactionId topXid;
 	TransactionId *children;
 	ExportedSnapshot *esnap;
@@ -1214,21 +1114,16 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	 * Copy the snapshot into TopTransactionContext, add it to the
 	 * exportedSnapshots list, and mark it pseudo-registered.  We do this to
 	 * ensure that the snapshot's xmin is honored for the rest of the
-	 * transaction.
+	 * transaction. XXX
 	 */
-	orig_snapshot = snapshot;
-	snapshot = CopyMVCCSnapshot(orig_snapshot);
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 	esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
 	esnap->snapfile = pstrdup(path);
 	esnap->snapshot = snapshot;
+	snapshot->refcount++;
 	exportedSnapshots = lappend(exportedSnapshots, esnap);
 	MemoryContextSwitchTo(oldcxt);
 
-	snapshot->regd_count++;
-	dlist_insert_after(&orig_snapshot->node, &snapshot->node);
-
 	/*
 	 * Fill buf with a text serialization of the snapshot, plus identification
 	 * data about this transaction.  The format expected by ImportSnapshot is
@@ -1248,8 +1143,8 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	/*
 	 * We must include our own top transaction ID in the top-xid data, since
 	 * by definition we will still be running when the importing transaction
-	 * adopts the snapshot, but GetSnapshotData never includes our own XID in
-	 * the snapshot.  (There must, therefore, be enough room to add it.)
+	 * adopts the snapshot, but GetMVCCSnapshotData never includes our own XID
+	 * in the snapshot.  (There must, therefore, be enough room to add it.)
 	 *
 	 * However, it could be that our topXid is after the xmax, in which case
 	 * we shouldn't include it because xip[] members are expected to be before
@@ -1334,7 +1229,7 @@ pg_export_snapshot(PG_FUNCTION_ARGS)
 {
 	char	   *snapshotName;
 
-	snapshotName = ExportSnapshot((MVCCSnapshot) GetActiveSnapshot());
+	snapshotName = ExportSnapshot(((MVCCSnapshot) GetActiveSnapshot())->shared);
 	PG_RETURN_TEXT_P(cstring_to_text(snapshotName));
 }
 
@@ -1438,7 +1333,7 @@ ImportSnapshot(const char *idstr)
 	Oid			src_dbid;
 	int			src_isolevel;
 	bool		src_readonly;
-	MVCCSnapshotData snapshot;
+	MVCCSnapshotShared snapshot;
 
 	/*
 	 * Must be at top level of a fresh transaction.  Note in particular that
@@ -1508,8 +1403,6 @@ ImportSnapshot(const char *idstr)
 	/*
 	 * Construct a snapshot struct by parsing the file content.
 	 */
-	memset(&snapshot, 0, sizeof(snapshot));
-
 	parseVxidFromText("vxid:", &filebuf, path, &src_vxid);
 	src_pid = parseIntFromText("pid:", &filebuf, path);
 	/* we abuse parseXidFromText a bit here ... */
@@ -1517,12 +1410,11 @@ ImportSnapshot(const char *idstr)
 	src_isolevel = parseIntFromText("iso:", &filebuf, path);
 	src_readonly = parseIntFromText("ro:", &filebuf, path);
 
-	snapshot.snapshot_type = SNAPSHOT_MVCC;
-
-	snapshot.xmin = parseXidFromText("xmin:", &filebuf, path);
-	snapshot.xmax = parseXidFromText("xmax:", &filebuf, path);
+	snapshot = AllocMVCCSnapshotShared();
+	snapshot->xmin = parseXidFromText("xmin:", &filebuf, path);
+	snapshot->xmax = parseXidFromText("xmax:", &filebuf, path);
 
-	snapshot.xcnt = xcnt = parseIntFromText("xcnt:", &filebuf, path);
+	snapshot->xcnt = xcnt = parseIntFromText("xcnt:", &filebuf, path);
 
 	/* sanity-check the xid count before palloc */
 	if (xcnt < 0 || xcnt > GetMaxSnapshotXidCount())
@@ -1530,15 +1422,15 @@ ImportSnapshot(const char *idstr)
 				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 				 errmsg("invalid snapshot data in file \"%s\"", path)));
 
-	snapshot.xip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
+	snapshot->xip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
 	for (i = 0; i < xcnt; i++)
-		snapshot.xip[i] = parseXidFromText("xip:", &filebuf, path);
+		snapshot->xip[i] = parseXidFromText("xip:", &filebuf, path);
 
-	snapshot.suboverflowed = parseIntFromText("sof:", &filebuf, path);
+	snapshot->suboverflowed = parseIntFromText("sof:", &filebuf, path);
 
-	if (!snapshot.suboverflowed)
+	if (!snapshot->suboverflowed)
 	{
-		snapshot.subxcnt = xcnt = parseIntFromText("sxcnt:", &filebuf, path);
+		snapshot->subxcnt = xcnt = parseIntFromText("sxcnt:", &filebuf, path);
 
 		/* sanity-check the xid count before palloc */
 		if (xcnt < 0 || xcnt > GetMaxSnapshotSubxidCount())
@@ -1546,17 +1438,19 @@ ImportSnapshot(const char *idstr)
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid snapshot data in file \"%s\"", path)));
 
-		snapshot.subxip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
+		snapshot->subxip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
 		for (i = 0; i < xcnt; i++)
-			snapshot.subxip[i] = parseXidFromText("sxp:", &filebuf, path);
+			snapshot->subxip[i] = parseXidFromText("sxp:", &filebuf, path);
 	}
 	else
 	{
-		snapshot.subxcnt = 0;
-		snapshot.subxip = NULL;
+		snapshot->subxcnt = 0;
 	}
 
-	snapshot.takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
+	snapshot->takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
+
+	snapshot->refcount = 1;
+	valid_snapshots_push_out_of_order(snapshot);
 
 	/*
 	 * Do some additional sanity checking, just to protect ourselves.  We
@@ -1565,8 +1459,8 @@ ImportSnapshot(const char *idstr)
 	 */
 	if (!VirtualTransactionIdIsValid(src_vxid) ||
 		!OidIsValid(src_dbid) ||
-		!TransactionIdIsNormal(snapshot.xmin) ||
-		!TransactionIdIsNormal(snapshot.xmax))
+		!TransactionIdIsNormal(snapshot->xmin) ||
+		!TransactionIdIsNormal(snapshot->xmax))
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 				 errmsg("invalid snapshot data in file \"%s\"", path)));
@@ -1604,7 +1498,7 @@ ImportSnapshot(const char *idstr)
 				 errmsg("cannot import a snapshot from a different database")));
 
 	/* OK, install the snapshot */
-	SetTransactionSnapshot(&snapshot, &src_vxid, src_pid, NULL);
+	SetTransactionSnapshot(snapshot, &src_vxid, src_pid, NULL);
 }
 
 /*
@@ -1670,18 +1564,21 @@ ThereAreNoPriorRegisteredSnapshots(void)
 
 	dlist_foreach(iter, &ValidSnapshots)
 	{
-		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+		MVCCSnapshotShared cur =
+			dlist_container(MVCCSnapshotSharedData, node, iter.cur);
+		uint32		allowedcount = 0;
 
 		if (FirstXactSnapshotRegistered)
 		{
 			Assert(CurrentSnapshotData.valid);
-			if (cur != &CurrentSnapshotData)
-				continue;
+			if (cur == CurrentSnapshotData.shared)
+				allowedcount++;
 		}
-		if (ActiveSnapshot && cur == ActiveSnapshot->as_snap)
-			continue;
+		if (ActiveSnapshot && cur == ActiveSnapshot->as_snap.shared)
+			allowedcount++;
 
-		return false;
+		if (cur->refcount != allowedcount)
+			return false;
 	}
 
 	return true;
@@ -1707,8 +1604,9 @@ HaveRegisteredOrActiveSnapshot(void)
 	 * registered more than one snapshot has to be in ValidSnapshots.
 	 */
 	if (CatalogSnapshotData.valid &&
-		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node &&
-		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.node)
+		CatalogSnapshotData.shared->refcount == 1 &&
+		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.shared->node &&
+		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.shared->node)
 	{
 		return false;
 	}
@@ -1775,11 +1673,11 @@ EstimateSnapshotSpace(MVCCSnapshot snapshot)
 
 	/* We allocate any XID arrays needed in the same palloc block. */
 	size = add_size(sizeof(SerializedSnapshotData),
-					mul_size(snapshot->xcnt, sizeof(TransactionId)));
-	if (snapshot->subxcnt > 0 &&
-		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
+					mul_size(snapshot->shared->xcnt, sizeof(TransactionId)));
+	if (snapshot->shared->subxcnt > 0 &&
+		(!snapshot->shared->suboverflowed || snapshot->shared->takenDuringRecovery))
 		size = add_size(size,
-						mul_size(snapshot->subxcnt, sizeof(TransactionId)));
+						mul_size(snapshot->shared->subxcnt, sizeof(TransactionId)));
 
 	return size;
 }
@@ -1794,15 +1692,15 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
 
-	Assert(snapshot->subxcnt >= 0);
+	Assert(snapshot->shared->subxcnt >= 0);
 
 	/* Copy all required fields */
-	serialized_snapshot.xmin = snapshot->xmin;
-	serialized_snapshot.xmax = snapshot->xmax;
-	serialized_snapshot.xcnt = snapshot->xcnt;
-	serialized_snapshot.subxcnt = snapshot->subxcnt;
-	serialized_snapshot.suboverflowed = snapshot->suboverflowed;
-	serialized_snapshot.takenDuringRecovery = snapshot->takenDuringRecovery;
+	serialized_snapshot.xmin = snapshot->shared->xmin;
+	serialized_snapshot.xmax = snapshot->shared->xmax;
+	serialized_snapshot.xcnt = snapshot->shared->xcnt;
+	serialized_snapshot.subxcnt = snapshot->shared->subxcnt;
+	serialized_snapshot.suboverflowed = snapshot->shared->suboverflowed;
+	serialized_snapshot.takenDuringRecovery = snapshot->shared->takenDuringRecovery;
 	serialized_snapshot.curcid = snapshot->curcid;
 
 	/*
@@ -1810,7 +1708,7 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 	 * taken during recovery - in that case, top-level XIDs are in subxip as
 	 * well, and we mustn't lose them.
 	 */
-	if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery)
+	if (serialized_snapshot.suboverflowed && !snapshot->shared->takenDuringRecovery)
 		serialized_snapshot.subxcnt = 0;
 
 	/* Copy struct to possibly-unaligned buffer */
@@ -1818,10 +1716,10 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 		   &serialized_snapshot, sizeof(SerializedSnapshotData));
 
 	/* Copy XID array */
-	if (snapshot->xcnt > 0)
+	if (snapshot->shared->xcnt > 0)
 		memcpy((TransactionId *) (start_address +
 								  sizeof(SerializedSnapshotData)),
-			   snapshot->xip, snapshot->xcnt * sizeof(TransactionId));
+			   snapshot->shared->xip, snapshot->shared->xcnt * sizeof(TransactionId));
 
 	/*
 	 * Copy SubXID array. Don't bother to copy it if it had overflowed,
@@ -1832,10 +1730,10 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 	if (serialized_snapshot.subxcnt > 0)
 	{
 		Size		subxipoff = sizeof(SerializedSnapshotData) +
-			snapshot->xcnt * sizeof(TransactionId);
+			snapshot->shared->xcnt * sizeof(TransactionId);
 
 		memcpy((TransactionId *) (start_address + subxipoff),
-			   snapshot->subxip, snapshot->subxcnt * sizeof(TransactionId));
+			   snapshot->shared->subxip, snapshot->shared->subxcnt * sizeof(TransactionId));
 	}
 }
 
@@ -1863,49 +1761,46 @@ RestoreSnapshot(char *start_address)
 	size = sizeof(MVCCSnapshotData)
 		+ serialized_snapshot.xcnt * sizeof(TransactionId)
 		+ serialized_snapshot.subxcnt * sizeof(TransactionId);
+	Assert(serialized_snapshot.xcnt <= GetMaxSnapshotXidCount());
+	Assert(serialized_snapshot.subxcnt <= GetMaxSnapshotSubxidCount());
 
 	/* Copy all required fields */
 	snapshot = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
 	snapshot->snapshot_type = SNAPSHOT_MVCC;
-	snapshot->xmin = serialized_snapshot.xmin;
-	snapshot->xmax = serialized_snapshot.xmax;
-	snapshot->xip = NULL;
-	snapshot->xcnt = serialized_snapshot.xcnt;
-	snapshot->subxip = NULL;
-	snapshot->subxcnt = serialized_snapshot.subxcnt;
-	snapshot->suboverflowed = serialized_snapshot.suboverflowed;
-	snapshot->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
+	snapshot->kind = SNAPSHOT_REGISTERED;
+	snapshot->shared = AllocMVCCSnapshotShared();
+	snapshot->shared->xmin = serialized_snapshot.xmin;
+	snapshot->shared->xmax = serialized_snapshot.xmax;
+	snapshot->shared->xcnt = serialized_snapshot.xcnt;
+	snapshot->shared->subxcnt = serialized_snapshot.subxcnt;
+	snapshot->shared->suboverflowed = serialized_snapshot.suboverflowed;
+	snapshot->shared->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
+	snapshot->shared->snapXactCompletionCount = 0;
+
+	snapshot->shared->refcount = 1;
+	valid_snapshots_push_out_of_order(snapshot->shared);
+
 	snapshot->curcid = serialized_snapshot.curcid;
-	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
 	{
-		snapshot->xip = (TransactionId *) (snapshot + 1);
-		memcpy(snapshot->xip, serialized_xids,
+		memcpy(snapshot->shared->xip, serialized_xids,
 			   serialized_snapshot.xcnt * sizeof(TransactionId));
 	}
 
 	/* Copy SubXIDs, if present. */
 	if (serialized_snapshot.subxcnt > 0)
 	{
-		snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
-			serialized_snapshot.xcnt;
-		memcpy(snapshot->subxip, serialized_xids + serialized_snapshot.xcnt,
+		memcpy(snapshot->shared->subxip, serialized_xids + serialized_snapshot.xcnt,
 			   serialized_snapshot.subxcnt * sizeof(TransactionId));
 	}
 
-	/* Set the copied flag so that the caller will set refcounts correctly. */
-	snapshot->regd_count = 0;
-	snapshot->active_count = 0;
-	snapshot->copied = true;
 	snapshot->valid = true;
 
 	/* and tell resowner.c about it, just like RegisterSnapshot() */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	snapshot->regd_count++;
 	ResourceOwnerRememberSnapshot(CurrentResourceOwner, (Snapshot) snapshot);
-	valid_snapshots_push_out_of_order(snapshot);
 
 	return snapshot;
 }
@@ -1919,21 +1814,21 @@ RestoreSnapshot(char *start_address)
 void
 RestoreTransactionSnapshot(MVCCSnapshot snapshot, void *source_pgproc)
 {
-	SetTransactionSnapshot(snapshot, NULL, InvalidPid, source_pgproc);
+	SetTransactionSnapshot(snapshot->shared, NULL, InvalidPid, source_pgproc);
 }
 
 /*
  * XidInMVCCSnapshot
  *		Is the given XID still-in-progress according to the snapshot?
  *
- * Note: GetSnapshotData never stores either top xid or subxids of our own
- * backend into a snapshot, so these xids will not be reported as "running"
- * by this function.  This is OK for current uses, because we always check
- * TransactionIdIsCurrentTransactionId first, except when it's known the
- * XID could not be ours anyway.
+ * Note: GetMVCCSnapshotData never stores either top xid or subxids of our own
+ * backend into a snapshot, so these xids will not be reported as "running" by
+ * this function.  This is OK for current uses, because we always check
+ * TransactionIdIsCurrentTransactionId first, except when it's known the XID
+ * could not be ours anyway.
  */
 bool
-XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot)
+XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot)
 {
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
@@ -2029,6 +1924,84 @@ XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot)
 	return false;
 }
 
+/*
+ * Allocate an MVCCSnapshotShared struct
+ *
+ * The 'xip' and 'subxip' arrays are allocated so that they can hold the max
+ * number of XIDs. That's usually overkill, but it allows us to do the
+ * allocation while not holding ProcArrayLock.
+ *
+ * MVCCSnapshotShared structs are kept in TopMemoryContext and refcounted.
+ * The refcount is initially zero, the caller is expected to increment it.
+ */
+MVCCSnapshotShared
+AllocMVCCSnapshotShared(void)
+{
+	MemoryContext save_cxt;
+	MVCCSnapshotShared shared;
+	size_t		size;
+	char	   *p;
+
+	/*
+	 * To reduce alloc/free overhead in GetMVCCSnapshotData(), we have a
+	 * single-element pool.
+	 */
+	if (spareSnapshotShared)
+	{
+		shared = spareSnapshotShared;
+		spareSnapshotShared = NULL;
+		return shared;
+	}
+
+	save_cxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	size = sizeof(MVCCSnapshotSharedData) +
+		GetMaxSnapshotXidCount() * sizeof(TransactionId) +
+		GetMaxSnapshotSubxidCount() * sizeof(TransactionId);
+	p = palloc(size);
+
+	shared = (MVCCSnapshotShared) p;
+	p += sizeof(MVCCSnapshotSharedData);
+	shared->xip = (TransactionId *) p;
+	p += GetMaxSnapshotXidCount() * sizeof(TransactionId);
+	shared->subxip = (TransactionId *) p;
+
+	shared->snapXactCompletionCount = 0;
+	shared->refcount = 0;
+
+	MemoryContextSwitchTo(save_cxt);
+
+	return shared;
+}
+
+/*
+ * Decrement the refcount on an MVCCSnapshotShared struct, freeing it if it
+ * reaches zero.
+ */
+static void
+ReleaseMVCCSnapshotShared(MVCCSnapshotShared shared)
+{
+	Assert(shared->refcount > 0);
+	shared->refcount--;
+
+	if (shared->refcount == 0)
+	{
+		dlist_delete(&shared->node);
+		if (shared != latestSnapshotShared)
+			FreeMVCCSnapshotShared(shared);
+	}
+}
+
+void
+FreeMVCCSnapshotShared(MVCCSnapshotShared shared)
+{
+	Assert(shared->refcount == 0);
+	if (spareSnapshotShared == NULL)
+		spareSnapshotShared = shared;
+	else
+		pfree(shared);
+}
+
 /* ResourceOwner callbacks */
 
 static void
@@ -2042,12 +2015,13 @@ ResOwnerReleaseSnapshot(Datum res)
 
 /* dlist_push_tail, with assertion that the list stays ordered by xmin */
 static void
-valid_snapshots_push_tail(MVCCSnapshot snapshot)
+valid_snapshots_push_tail(MVCCSnapshotShared snapshot)
 {
 #ifdef USE_ASSERT_CHECKING
 	if (!dlist_is_empty(&ValidSnapshots))
 	{
-		MVCCSnapshot tail = dlist_tail_element(MVCCSnapshotData, node, &ValidSnapshots);
+		MVCCSnapshotShared tail =
+			dlist_tail_element(MVCCSnapshotSharedData, node, &ValidSnapshots);
 
 		Assert(TransactionIdFollowsOrEquals(snapshot->xmin, tail->xmin));
 	}
@@ -2062,13 +2036,14 @@ valid_snapshots_push_tail(MVCCSnapshot snapshot)
  * the list is small.
  */
 static void
-valid_snapshots_push_out_of_order(MVCCSnapshot snapshot)
+valid_snapshots_push_out_of_order(MVCCSnapshotShared snapshot)
 {
 	dlist_iter	iter;
 
 	dlist_foreach(iter, &ValidSnapshots)
 	{
-		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+		MVCCSnapshotShared cur =
+			dlist_container(MVCCSnapshotSharedData, node, iter.cur);
 
 		if (TransactionIdFollowsOrEquals(snapshot->xmin, cur->xmin))
 		{
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..e71c660118e 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -242,8 +242,8 @@ typedef struct TransamVariablesData
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
 	 * the server. This currently is solely used to check whether
-	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
-	 * not. There are likely other users of this.  Always above 1.
+	 * GetMVCCSnapshotData() needs to recompute the contents of the snapshot,
+	 * or not. There are likely other users of this.  Always above 1.
 	 */
 	uint64		xactCompletionCount;
 
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 6a78dfeac96..e68862576ee 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -47,10 +47,10 @@ extern void CheckPointPredicate(void);
 extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
 
 /* predicate lock maintenance */
-extern MVCCSnapshot GetSerializableTransactionSnapshot(MVCCSnapshot snapshot);
-extern void SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
-											   VirtualTransactionId *sourcevxid,
-											   int sourcepid);
+extern MVCCSnapshotShared GetSerializableTransactionSnapshotData(void);
+extern void SetSerializableTransactionSnapshotData(MVCCSnapshotShared snapshot,
+												   VirtualTransactionId *sourcevxid,
+												   int sourcepid);
 extern void RegisterPredicateLockingXid(TransactionId xid);
 extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
 extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f51b03d3822..46b58a17489 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -324,7 +324,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
  * Adding/Removing an entry into the procarray requires holding *both*
  * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
  * needed because the dense arrays (see below) are accessed from
- * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * GetNewTransactionId() and GetMVCCSnapshotData(), and we don't want to add
  * further contention by both using the same lock. Adding/Removing a procarray
  * entry is much less frequent.
  *
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 7f5727c2586..8eedc2d6b9f 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -44,7 +44,7 @@ extern void KnownAssignedTransactionIdsIdleMaintenance(void);
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
 
-extern MVCCSnapshot GetSnapshotData(MVCCSnapshot snapshot);
+extern MVCCSnapshotShared GetMVCCSnapshotData(void);
 
 extern bool ProcArrayInstallImportedXmin(TransactionId xmin,
 										 VirtualTransactionId *sourcevxid);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 1f627ff966d..36c6043740f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -56,6 +56,13 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+/* exported so that GetMVCCSnapshotData() can access these */
+extern MVCCSnapshotShared latestSnapshotShared;
+extern MVCCSnapshotShared spareSnapshotShared;
+
+extern MVCCSnapshotShared AllocMVCCSnapshotShared(void);
+extern void FreeMVCCSnapshotShared(MVCCSnapshotShared shared);
+
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
@@ -89,7 +96,7 @@ extern void WaitForOlderSnapshots(TransactionId limitXmin, bool progress);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
 extern bool HaveRegisteredOrActiveSnapshot(void);
 
-extern char *ExportSnapshot(MVCCSnapshot snapshot);
+extern char *ExportSnapshot(MVCCSnapshotShared snapshot);
 
 /*
  * These live in procarray.c because they're intimately linked to the
@@ -105,7 +112,7 @@ extern bool GlobalVisCheckRemovableFullXid(Relation rel, FullTransactionId fxid)
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
-extern bool XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot);
+extern bool XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 44b3b20f73c..193366ce052 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -119,17 +119,44 @@ typedef enum SnapshotType
 	SNAPSHOT_NON_VACUUMABLE,
 } SnapshotType;
 
+typedef struct MVCCSnapshotSharedData *MVCCSnapshotShared;
+
+typedef enum MVCCSnapshotKind
+{
+	SNAPSHOT_STATIC,
+	SNAPSHOT_ACTIVE,
+	SNAPSHOT_REGISTERED,
+} MVCCSnapshotKind;
+
 /*
  * Struct representing a normal MVCC snapshot.
  *
  * MVCC snapshots come in two variants: those taken during recovery in hot
  * standby mode, and "normal" MVCC snapshots.  They are distinguished by
- * takenDuringRecovery.
+ * shared->takenDuringRecovery.
  */
 typedef struct MVCCSnapshotData
 {
 	SnapshotType snapshot_type; /* type of snapshot, must be first */
 
+	/*
+	 * Most fields are in this separate struct which can be reused and shared
+	 * between snapshots that only differ in the command ID.  It is reference
+	 * counted separately.
+	 */
+	MVCCSnapshotShared shared;
+
+	CommandId	curcid;			/* in my xact, CID < curcid are visible */
+
+	/*
+	 * Book-keeping information, used by the snapshot manager
+	 */
+	MVCCSnapshotKind kind;
+	bool		valid;
+} MVCCSnapshotData;
+
+typedef struct MVCCSnapshotSharedData
+{
 	/*
 	 * An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
 	 * the effects of all older XIDs except those listed in the snapshot. xmin
@@ -160,25 +187,17 @@ typedef struct MVCCSnapshotData
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
-	bool		copied;			/* false if it's a static snapshot */
-	bool		valid;			/* is this snapshot valid? */
-
-	CommandId	curcid;			/* in my xact, CID < curcid are visible */
-
-	/*
-	 * Book-keeping information, used by the snapshot manager
-	 */
-	uint32		active_count;	/* refcount on ActiveSnapshot stack */
-	uint32		regd_count;		/* refcount of registrations in resowners */
-	dlist_node	node;			/* link in ValidSnapshots */
 
 	/*
-	 * The transaction completion count at the time GetSnapshotData() built
-	 * this snapshot. Allows to avoid re-computing static snapshots when no
-	 * transactions completed since the last GetSnapshotData().
+	 * The transaction completion count at the time GetMVCCSnapshotData()
+	 * built this snapshot. Allows to avoid re-computing static snapshots when
+	 * no transactions completed since the last GetMVCCSnapshotData().
 	 */
 	uint64		snapXactCompletionCount;
-} MVCCSnapshotData;
+
+	uint32		refcount;
+	dlist_node	node;			/* link in ValidSnapshots */
+} MVCCSnapshotSharedData;
 
 typedef struct MVCCSnapshotData *MVCCSnapshot;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c8ed18cf580..990c83c902a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1636,6 +1636,8 @@ MINIDUMP_TYPE
 MJEvalResult
 MTTargetRelLookup
 MVCCSnapshotData
+MVCCSnapshotKind
+MVCCSnapshotSharedData
 MVDependencies
 MVDependency
 MVNDistinct
-- 
2.39.5

v6-0008-XXX-add-perf-test.patchtext/x-patch; charset=UTF-8; name=v6-0008-XXX-add-perf-test.patchDownload

From 511f67bc9579c5fcec923fa0fcb20370547561f2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 22:29:44 +0300
Subject: [PATCH v6 08/12] XXX: add perf test

This is not intended to be merged. But it's been useful to have this
in the tree for some quick perf testing during development.

To run it, I've used:

(cd build-release && ninja &&  rm -rf tmp_install && meson test --suite setup --suite test_misc; grep TEST testrun/test_misc/000_csn_perf/log/regress_log_000_csn_perf )

It runs the other test_misc tests concurrently, but they finish a lot
faster so they don't affect the results much.
---
 src/test/modules/test_misc/meson.build       |   1 +
 src/test/modules/test_misc/t/000_csn_perf.pl | 337 +++++++++++++++++++
 2 files changed, 338 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/000_csn_perf.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 9c50de7efb0..1c385123448 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
+      't/000_csn_perf.pl',
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
diff --git a/src/test/modules/test_misc/t/000_csn_perf.pl b/src/test/modules/test_misc/t/000_csn_perf.pl
new file mode 100644
index 00000000000..3915878a407
--- /dev/null
+++ b/src/test/modules/test_misc/t/000_csn_perf.pl
@@ -0,0 +1,337 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Verify that ALTER TABLE optimizes certain operations as expected
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(time);
+
+my $duration = 15; # seconds
+my $miniterations = 3;
+
+# Initialize a test cluster
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+# Turn message level up to DEBUG1 so that we get the messages we want to see
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->append_conf('postgresql.conf', 'max_connections = 1005');
+$primary->start;
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->append_conf('postgresql.conf', "shared_buffers='1 GB'");
+$replica->start;
+
+sub wait_catchup
+{
+	my ($primary, $replica) = @_;
+	
+	my $primary_lsn =
+	  $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+	my $caughtup_query =
+	  "SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+	$replica->poll_query_until('postgres', $caughtup_query)
+	  or die "Timed out while waiting for standby to catch up";
+}
+
+sub repeat_and_time_sql
+{
+  	my ($name, $node, $sql) = @_;
+
+	my $session =  $node->background_psql('postgres', on_error_die => 1);
+	$session->query_safe("SET max_parallel_workers_per_gather=0");
+
+	my $iterations = 0;
+
+	my $now;
+	my $elapsed;
+    my $begin_time = time();
+	while (1) {
+		$session->query_safe($sql);
+		$now = time();
+		$iterations = $iterations + 1;
+
+		$elapsed = $now - $begin_time;
+		if ($elapsed > $duration && $iterations >= $miniterations) {
+			last;
+		}
+	}
+
+	my $periter = $elapsed / $iterations;
+
+	pass ("TEST $name: $elapsed s, $iterations iterations, $periter s / iteration");
+}
+
+
+$primary->safe_psql('postgres', "CREATE TABLE little (i int);");
+$primary->safe_psql('postgres', "INSERT INTO little VALUES (1);");
+
+sub consume_xids
+{
+	my ($node) = @_;
+
+	my $session = $node->background_psql('postgres', on_error_die => 1);
+	for(my $i = 0; $i < 20; $i++) {
+		$session->query_safe(q{do $$
+  begin
+    for i in 1..50 loop
+      begin
+        DELETE from little;
+        perform 1 / 0;
+      exception
+        when division_by_zero then perform 0 /* do nothing */;
+        when others then raise 'fail: %', sqlerrm;
+      end;
+    end loop;
+  end
+$$;});
+	}
+	$session->quit;
+}
+
+# TEST few-xacts
+#
+# Cycle through 4 different top-level XIDs
+#
+# 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 4;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts
+#
+# like few-xacts, but we cycle through 100 different XIDs instead of 4.
+#
+# 1001, 1002, 1003, ... 1100, 1001, 1002, 1003, ... 1100  ....
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts-wide-apart
+#
+# like many-xacts, but the XIDs are more spread out, so that they don't fit in the
+# SLRU caches.
+#
+# 1000, 2000, 3000, 4000, ....
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+
+		consume_xids($primary);
+
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts-wide-apart", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: few-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 4;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+
+# TEST: many-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: many-subxacts-wide-apart
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		consume_xids($primary);
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts-wide-apart", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-xids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+	my @primary_sessions = ();
+	my $num_connections = 1000;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("INSERT INTO tbl VALUES ($i)");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-xids", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-subxids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i; INSERT INTO tbl VALUES($i); release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-subxids", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+done_testing();
-- 
2.39.5

v6-0009-Use-CSN-snapshots-during-Hot-Standby.patchtext/x-patch; charset=UTF-8; name=v6-0009-Use-CSN-snapshots-during-Hot-Standby.patchDownload

From 7fec26347c80d42f0243f0d3328b38c69105a41f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 00:16:17 +0300
Subject: [PATCH v6 09/12] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |    1 +
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  469 +++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    1 +
 src/backend/access/transam/xact.c             |  138 +-
 src/backend/access/transam/xlog.c             |  118 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/backup/basebackup.c               |    3 +
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1538 ++---------------
 src/backend/storage/ipc/standby.c             |  102 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   34 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/bin/pg_rewind/filemap.c                   |    3 +
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/utils/snapshot.h                  |    8 +
 33 files changed, 821 insertions(+), 1793 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d79ef35006b..c5c7a4dd2c3 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -607,6 +607,7 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    now perform minimal checking on a standby by always using nextXid, this
  *    approach is better than nothing and will at least catch extremely broken
  *    cases where a xid is in the future.
+ *    XXX KnownAssignedXids is gone.
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 715cc1f7bad..56f7bd81780 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -422,17 +422,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -460,18 +449,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -503,9 +480,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..2520d77c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 00000000000..40673c8579f
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,469 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit record LSNs of finished transactions
+ *
+ * This module provides an SLRU to store the LSN of the commit record of each
+ * transaction. CSN stands for Commit Sequence Number, and in principle we
+ * could use a separate counter that is incremented at every commit. For
+ * simplicity, though, we use the commit records LSN as the sequence number.
+ *
+ * Like pg_subtrans, this mapping need to be kept only for xid's greater then
+ * oldestXmin, and doesn't need to be preserved over crashes.  Also, this is
+ * only needed in hot standby mode, and immediately after exiting hot standby
+ * mode, until all old snapshots taken during standby mode are gone.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CSNLog control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+								TransactionId *subxids,
+								XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+							   int slotno);
+
+
+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, in logical XID order,
+ * representing subtransactions in the tree of XIDs. In various cases nsubxids
+ * may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the CSN log for all
+ * entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as CSNLogSetCSN()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids, TransactionId *subxids,
+					XLogRecPtr commitLsn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i], commitLsn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, commitLsn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	/* FIXME: skip if not InHotStandby? */
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	SlruPagePrecedesUnitTests(CsnlogCtl, CSN_LOG_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ * All committed transactions are stamped with 'csn'
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus	status = TransactionIdGetStatus(*xid, &dummy);
+
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid, and after
+ * initializing the CLOG.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'. ('csn'
+ * is supposed to be an "older than anything we'll ever need to compare with")
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we initialize
+	 * the currently-active page(s) to zeroes during startup. Whenever we
+	 * advance into a new page, ExtendCSNLog will likewise zero the new page
+	 * without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation purposes.
+ * Analogous to CLOGPagePrecedes() and SubTransPagePrecedes().
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId + 1;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId + 1;
+
+	return (TransactionIdPrecedes(xid1, xid2) &&
+			TransactionIdPrecedes(xid1, xid2 + CSN_LOG_XACTS_PER_PAGE - 1));
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..e2a3419fc22 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 9a39451a29a..b4c42c0f156 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have the CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194..2330632e569 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1943,20 +1944,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -1984,34 +1978,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fe895787cb7..a495f1d7899 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..5250a158145 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -210,7 +211,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -250,13 +250,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -532,18 +525,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -636,7 +617,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -680,20 +660,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -729,59 +695,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData(&xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData(unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1481,11 +1394,11 @@ RecordTransactionCommit(void)
 	 * temp tables will be lost anyway, unlogged tables will be truncated and
 	 * HOT pruning will be done again later. (Given the foregoing, you might
 	 * think that it would be unnecessary to emit the XLOG record at all in
-	 * this case, but we don't currently try to do that.  It would certainly
-	 * cause problems at least in Hot Standby mode, where the
-	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
-	 * might be OK to skip it only when wal_level < replica, but for now we
-	 * don't.)
+	 * this case, but we don't currently try to do that.  It might cause
+	 * inefficiencies in Hot Standby mode, if nothing else, where the
+	 * commit/abort records allow advancing the xmin horizon for new
+	 * snapshots. It might be OK to skip it only when wal_level < replica, but
+	 * for now we don't.)
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
@@ -1953,13 +1866,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2142,12 +2048,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6154,7 +6054,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6174,6 +6074,12 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * Mark the CSNLOG first.  The transaction won't become visible to new
+		 * snapshots until the call to ProcArrayRecoveryEndTransaction().
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6186,9 +6092,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * Make the commit visible to new snapshots in the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6294,7 +6200,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6312,13 +6218,15 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/* Note: we don't need to update the CSN log on abort. */
+
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
 
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6426,14 +6334,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc30a52d496..cbeac223e1c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -951,8 +952,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5230,6 +5229,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5831,16 +5831,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5855,39 +5855,17 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
-			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
-			 */
-			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
 
-				/* Update pg_subtrans entries for any prepared transactions */
-				StandbyRecoverPreparedTransactions();
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
 
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_status = SUBXIDS_IN_SUBTRANS;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-			}
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
+			/* Update pg_subtrans entries for any prepared transactions */
+			if (wasShutdown)
+				StandbyRecoverPreparedTransactions();
 		}
 
 		/*
@@ -5971,7 +5949,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -6137,9 +6115,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6225,12 +6212,12 @@ StartupXLOG(void)
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
 	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
-	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * any session building a snapshot will not rely on the CSN log as
 	 * RecoveryInProgress() would return false at this stage.  This is
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -7002,7 +6989,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7396,6 +7383,9 @@ CreateCheckPoint(int flags)
 	if (!RecoveryInProgress())
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
+	if (shutdown)
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
@@ -7567,6 +7557,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7863,7 +7854,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8348,41 +8342,17 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			/* Update pg_subtrans entries for any prepared transactions */
 			StandbyRecoverPreparedTransactions();
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_status = SUBXIDS_IN_SUBTRANS;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
@@ -8446,6 +8416,16 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Remember the oldest XID that was running at the time.  Normally,
+		 * all transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0aa3ab59085..b213b8a74dc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1978,10 +1978,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2258,7 +2257,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3715,9 +3714,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3983,9 +3979,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index c389b27f77d..775e1a926d8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 891637e3a44..f1307ed714c 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -181,6 +181,9 @@ static const char *const excludeDirContents[] =
 	/* Contents zeroed on startup, see StartupSUBTRANS(). */
 	"pg_subtrans",
 
+	/* Contents zeroed on startup, see StartupCSNLog(). */
+	"pg_csn",
+
 	/* end of list */
 	NULL
 };
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 27e86cf393f..d04286ab270 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6a428e9720e..808b1d85379 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3c94a62cdf6..97d278052df 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..fc9804b2eab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -122,6 +123,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -287,6 +289,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 819649741f6..3418ddf5304 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -74,22 +65,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -100,6 +77,21 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+#define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+
+/*
+ * TOTAL_MAX_CACHED_SUBXIDS is the total number of XIDs that fits in the proc
+ * array, as top XIDs and in the subxids caches.
+ *
+ * Local data structures are also created in various backends during
+ * GetSnapshotData(), TransactionIdIsInProgress() and
+ * GetRunningTransactionData(). All of the main structures created in those
+ * functions must be identically sized, since we may at times copy the whole
+ * of the data structures around.
+ */
+#define TOTAL_MAX_CACHED_SUBXIDS \
+	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+
 /*
  * State for the GlobalVisTest* family of functions. Those functions can
  * e.g. be used to decide if a deleted row can be removed without violating
@@ -255,17 +247,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -277,19 +258,10 @@ static PGPROC *allProcs;
 static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 
 /*
- * Bookkeeping for tracking emulated transactions in recovery
+ * Bookkeeping for tracking transactions seen during recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -316,7 +288,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -326,7 +298,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -339,28 +311,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -384,31 +340,6 @@ ProcArrayShmemSize(void)
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
-	/*
-	 * During Hot Standby processing we have a data structure called
-	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetMVCCSnapshotData(),
-	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
-	 * main structures created in those functions must be identically sized,
-	 * since we may at times copy the whole of the data structures around. We
-	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
-	 *
-	 * Ideally we'd only create this structure if we were actually doing hot
-	 * standby in the current run, but we don't know that yet at the time
-	 * shared memory is being set up.
-	 */
-#define TOTAL_MAX_CACHED_SUBXIDS \
-	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
-
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -435,31 +366,12 @@ ProcArrayShmemInit(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1023,355 +935,35 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
 	/*
 	 * Remove stale locks, if any.
 	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (running->subxid_status != SUBXIDS_MISSING || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
+	StandbyReleaseOldLocks(oldestRunningXID);
 
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_status == SUBXIDS_MISSING)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-
-		/*
-		 * If the 'xids' array didn't include all subtransactions, we have to
-		 * mark any snapshots taken as overflowed.
-		 */
-		if (running->subxid_status == SUBXIDS_IN_SUBTRANS)
-			procArray->lastOverflowedXid = latestObservedXid;
-		else
-		{
-			Assert(running->subxid_status == SUBXIDS_IN_ARRAY);
-			procArray->lastOverflowedXid = InvalidTransactionId;
-		}
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1379,23 +971,24 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction is known to have committed
+ * or aborted, otherwise it's considered as running.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * if that is running according to ProcGlobal->xids[].
  * This is the slowest way, but sadly it has to be done always if the others
  * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
- * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
- * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
+ * ProcArrayLock has to be held while we do 2 and 3.  If we save the top Xids
+ * while doing 2 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
  * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
@@ -1436,6 +1029,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1452,12 +1067,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1552,33 +1162,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1852,8 +1435,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1866,11 +1448,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2181,7 +1766,7 @@ GetMVCCSnapshotData(void)
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
 	MVCCSnapshotShared snapshot;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2355,27 +1940,8 @@ GetMVCCSnapshotData(void)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2383,14 +1949,17 @@ GetMVCCSnapshotData(void)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * (XXX: not sure that's strictly required.) This is what determines
+		 * which transactions we consider finished and which are still in
+		 * progress.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2507,6 +2076,8 @@ GetMVCCSnapshotData(void)
 		latestSnapshotShared = snapshot;
 	}
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2662,9 +2233,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * The returned data structure is statically allocated; caller should not
  * modify it, and must not assume it is valid past the next call.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
- *
  * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
@@ -2695,6 +2263,7 @@ GetRunningTransactionData(void)
 	int			subcount;
 	bool		suboverflowed;
 
+	/* This is never executed during recovery */
 	Assert(!RecoveryInProgress());
 
 	/*
@@ -2861,15 +2430,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetMVCCSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2890,11 +2460,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2903,6 +2475,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2981,8 +2556,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
-	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
-	 * machinery can miss values and return an older value than is safe.
+	 * *not* use oldest_running_primary_xid since the XID tracking machinery
+	 * can miss values and return an older value than is safe.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3400,6 +2975,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4071,14 +3649,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4325,61 +3903,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetMVCCSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4394,7 +3917,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4412,38 +3935,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4455,805 +3959,61 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 }
 
 /*
- * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
+ * ProcArrayRecoveryEndTransaction
+ *
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction(). The transaction becomes visible to any new
+ * snapshots taken after this. 'max_xid' is the highest (sub)XID of the
+ * committed transaction, and 'lsn' is LSN of the commit record.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * The transaction and all its subtransactions have been already marked as
+ * committed in the CLOG and in the CSNLOG.
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * If this was the oldest XID that was still running, advance it. This is
+	 * important for advancing the global xmin, which avoids unnecessary
+	 * recovery conflicts
+	 *
+	 * No locking required because this runs in the startup process.
+	 *
+	 * XXX: the caller actually has a list of XIDs that just committed. We
+	 * could save some clog lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	FullTransactionId latestXid;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/* Reset latestCompletedXid to nextXid - 1 */
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-	latestXid = TransamVariables->nextXid;
-	FullTransactionIdRetreat(&latestXid);
-	TransamVariables->latestCompletedXid = latestXid;
-
-	/*
-	 * Any transactions that were in-progress were effectively aborted, so
-	 * advance xactCompletionCount.
-	 */
-	TransamVariables->xactCompletionCount++;
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	TransactionId latestXid;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
-	latestXid = xid;
-	TransactionIdRetreat(latestXid);
-	MaintainLatestCompletedXidRecovery(latestXid);
-
-	/* ... and xactCompletionCount */
-	TransamVariables->xactCompletionCount++;
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..217b1670f5b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -1167,7 +1162,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1182,18 +1177,21 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
 
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_status = xlrec->subxid_overflow ? SUBXIDS_MISSING : SUBXIDS_IN_ARRAY;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
-
-		ProcArrayApplyRecoveryInfo(&running);
+		/*
+		 * Remember the oldest XID that was running at the time. Normally, all
+		 * transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1224,50 +1222,46 @@ standby_redo(XLogReaderState *record)
  *
  * This is used for Hot Standby as follows:
  *
- * We can move directly to STANDBY_SNAPSHOT_READY at startup if we
- * start from a shutdown checkpoint because we know nothing was running
- * at that time and our recovery snapshot is known empty. In the more
- * typical case of an online checkpoint we need to jump through a few
- * hoops to get a correct recovery snapshot and this requires a two or
- * sometimes a three stage process.
+ * We can enter hot standby mode and start accepting read-only queries
+ * immediately at startup if we start from a shutdown checkpoint, because we
+ * know nothing was running at that time and our recovery snapshot is known
+ * empty. In the more typical case of an online checkpoint, the checkpoint
+ * record doesn't contain all the necessary information about running
+ * transaction state, and we need to jump through a few hoops to get a correct
+ * recovery snapshot.
  *
- * The initial snapshot must contain all running xids and all current
- * AccessExclusiveLocks at a point in time on the standby. Assembling
- * that information while the server is running requires many and
- * various LWLocks, so we choose to derive that information piece by
- * piece and then re-assemble that info on the standby. When that
- * information is fully assembled we move to STANDBY_SNAPSHOT_READY.
+ * The initial snapshot must contain all current AccessExclusiveLocks at a
+ * point in time on the standby. Assembling that information while the server
+ * is running requires many and various LWLocks, so we choose to derive that
+ * information piece by piece and then re-assemble that info on the standby.
  *
- * Since locking on the primary when we derive the information is not
- * strict, we note that there is a time window between the derivation and
- * writing to WAL of the derived information. That allows race conditions
- * that we must resolve, since xids and locks may enter or leave the
- * snapshot during that window. This creates the issue that an xid or
- * lock may start *after* the snapshot has been derived yet *before* the
- * snapshot is logged in the running xacts WAL record. We resolve this by
- * starting to accumulate changes at a point just prior to when we derive
- * the snapshot on the primary, then ignore duplicates when we later apply
- * the snapshot from the running xacts record. This is implemented during
- * CreateCheckPoint() where we use the logical checkpoint location as
- * our starting point and then write the running xacts record immediately
- * before writing the main checkpoint WAL record. Since we always start
- * up from a checkpoint and are immediately at our starting point, we
- * unconditionally move to STANDBY_INITIALIZED. After this point we
- * must do 4 things:
+ * Since locking on the primary when we derive the information is not strict,
+ * there is a time window between the derivation and writing to WAL of the
+ * derived information. That allows race conditions that we must resolve,
+ * since xids and locks may enter or leave the snapshot during that
+ * window. This creates the issue that an xid or lock may start *after* the
+ * snapshot has been derived yet *before* the snapshot is logged in the
+ * running xacts WAL record. We resolve this by starting to accumulate changes
+ * at a point just prior to when we collect the lock information on the
+ * primary, then ignore duplicates when we later apply the snapshot from the
+ * running xacts record. This is implemented during CreateCheckPoint() where
+ * we use the logical checkpoint location as our starting point and then write
+ * the running xacts record immediately before writing the main checkpoint WAL
+ * record. Since we always start up from a checkpoint's redo pointer, we will
+ * always see a running-xacts record between before reaching the checkpoint
+ * record, and can immediately enter hot standby mode. After this point we
+ * must do 3 things:
  *	* move shared nextXid forwards as we see new xids
  *	* extend the clog and subtrans with each new xid
- *	* keep track of uncommitted known assigned xids
  *	* keep track of uncommitted AccessExclusiveLocks
  *
- * When we see a commit/abort we must remove known assigned xids and locks
- * from the completing transaction. Attempted removals that cannot locate
- * an entry are expected and must not cause an error when we are in state
- * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
- * KnownAssignedXidsRemove().
- *
- * Later, when we apply the running xact data we must be careful to ignore
- * transactions already committed, since those commits raced ahead when
- * making WAL entries.
+ * When we see a commit/abort we must advance oldest_running_primary_xid and
+ * remove locks from the completing transaction. Attempted removals that
+ * cannot locate an entry are expected and must not cause an error until we
+ * have seen the running-xacts record. (We don't throw an error even after
+ * that, because whatever the reason was, after the transaction has completed
+ * the issue has already been resolved anyway.) This is implemented in
+ * StandbyReleaseLocks().
  *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3df29658f18..aadec36dc15 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -140,6 +140,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -178,6 +179,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4f44648aca8..95e248b2c88 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -363,6 +363,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..d8ff9cfdb36 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5f9f2b9d8b2..049c706f2cf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -97,6 +97,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -1888,36 +1889,11 @@ XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
-
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
+		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
 
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c17fda2bc81..f52817e218f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -251,7 +251,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index a28d1667d4c..64fdd139173 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -146,6 +146,9 @@ static const char *const excludeDirContents[] =
 	/* Contents zeroed on startup, see StartupSUBTRANS(). */
 	"pg_subtrans",
 
+	/* Contents zeroed on startup, see StartupCSNLog(). */
+	"pg_csn",
+
 	/* end of list */
 	NULL
 };
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 00000000000..f8cdf573aef
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Mapping from XID to commit record's LSN (Commit Sequence Number).
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+						 TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif							/* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index e71c660118e..76411cca178 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr	latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa82355033..9527695886f 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..b31944d0e6c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -448,7 +439,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index a1870d8e5aa..2ab20fcae2f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -27,37 +27,10 @@ extern PGDLLIMPORT bool ignore_invalid_pages;
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 4df1d25c045..457c5511c5e 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -181,6 +181,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -219,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_AIO_URING_COMPLETION,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 8eedc2d6b9f..57071d1e0f4 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 193366ce052..14ff80904c8 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
 #ifndef SNAPSHOT_H
 #define SNAPSHOT_H
 
+#include "access/xlogdefs.h"
 #include "lib/ilist.h"
 
 
@@ -186,6 +187,13 @@ typedef struct MVCCSnapshotSharedData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 
 	/*
-- 
2.39.5

v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchtext/x-patch; charset=UTF-8; name=v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchDownload

From 2565b8554e321e8ca9a87f36a48f9ab7f86ab247 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:01:07 +0300
Subject: [PATCH v6 10/12] Make SnapBuildWaitSnapshot work without
 xl_running_xacts.xids array

SnapBuildWaitSnapshot looped through all the XIDs in the
xl_running_xacts, waiting for them to finish. Change it to grab the
list of running XIDs from the proc array instead. This removes the
last usage of the XIDs array in the xl_running_xacts record, allowing
it to be removed in the next commit.

When SnapBuildWaitSnapshot() is called with running->nextXid as the
'cutoff' point, the new code should wait for exactly the same set of
transactions as before. But when called with initial_xmin_horizon as
the 'cutoff', this might wait for more transactions than before: those
between running->nextXid and initial_xmin_horizon. For example,
imagine that we see a running-xacts record with nextXid 100, and
initial_xmin_horizon is 200. Before, we would wait for all XIDs < 100
to complete, and then log the standby snapshot and proceed, but now we
will wait for all XIDs < 200. I believe that's a good thing, because
we won't actually be able to move to the next state in the snapshot
building until all transactions < 200 have completed. The
running-xacts snapshot that we logged after waiting up to XID 100
would not be useful to us either, if there are still XIDs between 100
and 200 running.

SnapBuildWaitSnapshot() used to do useless work when called in a
standby, because in a standby, there are no XID locks and the
XactLockTableWait() calls returned immediately, even if the XIDs were
in fact still running in the primary. But as the comment says, the
waiting isn't necessary for correctness, so that was harmless. In any
case, stop doing the futile work on a standby.
---
 src/backend/replication/logical/snapbuild.c | 50 ++++++++++++++-------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 97d278052df..252526ecf91 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -164,7 +164,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
-static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
+static void SnapBuildWaitSnapshot(TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -1192,14 +1192,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		NormalTransactionIdPrecedes(running->oldestRunningXid,
 									builder->initial_xmin_horizon))
 	{
+		TransactionId cutoff;
+
 		ereport(DEBUG1,
 				(errmsg_internal("skipping snapshot at %X/%X while building logical decoding snapshot, xmin horizon too low",
 								 LSN_FORMAT_ARGS(lsn)),
 				 errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 									builder->initial_xmin_horizon, running->oldestRunningXid)));
 
-
-		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+		cutoff = builder->initial_xmin_horizon;
+		TransactionIdRetreat(cutoff);
+		SnapBuildWaitSnapshot(cutoff);
 
 		return true;
 	}
@@ -1286,7 +1289,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1310,7 +1313,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1343,8 +1346,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 }
 
 /* ---
- * Iterate through xids in record, wait for all older than the cutoff to
- * finish.  Then, if possible, log a new xl_running_xacts record.
+ * Wait for all transactions older than or equal to the cutoff to finish.
+ * Then, if possible, log a new xl_running_xacts record.
  *
  * This isn't required for the correctness of decoding, but to:
  * a) allow isolationtester to notice that we're currently waiting for
@@ -1354,13 +1357,31 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  * ---
  */
 static void
-SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+SnapBuildWaitSnapshot(TransactionId cutoff)
 {
-	int			off;
+	RunningTransactions running;
+
+	if (RecoveryInProgress())
+	{
+		/*
+		 * During recovery, we have no mechanism for waiting for an XID to
+		 * finish, and we cannot create new running-xacts records either.
+		 */
+		return;
+	}
+
+	running = GetRunningTransactionData();
+
+	/*
+	 * GetRunningTransactionData returns with XidGenLock and ProcArrayLock
+	 * held, but we don't need them.
+	 */
+	LWLockRelease(XidGenLock);
+	LWLockRelease(ProcArrayLock);
 
-	for (off = 0; off < running->xcnt; off++)
+	for (int i = 0; i < running->xcnt; i++)
 	{
-		TransactionId xid = running->xids[off];
+		TransactionId xid = running->xids[i];
 
 		/*
 		 * Upper layers should prevent that we ever need to wait on ourselves.
@@ -1370,7 +1391,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 		if (TransactionIdIsCurrentTransactionId(xid))
 			elog(ERROR, "waiting for ourselves");
 
-		if (TransactionIdFollows(xid, cutoff))
+		if (TransactionIdFollowsOrEquals(xid, cutoff))
 			continue;
 
 		XactLockTableWait(xid, NULL, NULL, XLTW_None);
@@ -1382,10 +1403,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 	 * wait for bgwriter or checkpointer to log one.  During recovery we can't
 	 * enforce that, so we'll have to wait.
 	 */
-	if (!RecoveryInProgress())
-	{
-		LogStandbySnapshot();
-	}
+	LogStandbySnapshot();
 }
 
 #define SnapBuildOnDiskConstantSize \
-- 
2.39.5

v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patchtext/x-patch; charset=UTF-8; name=v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patchDownload

From 51212a4f053edb5e4ceef65e3ce5e722fbc3844b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 16:40:57 +0300
Subject: [PATCH v6 11/12] Remove the now-unused xids array from
 xl_running_xacts

We still generate running-xacts records, because they are still needed
to initialize the snapshot in logical decoding.
---
 src/backend/access/rmgrdesc/standbydesc.c   | 18 ------------
 src/backend/replication/logical/snapbuild.c |  8 +++---
 src/backend/storage/ipc/standby.c           | 32 +++++----------------
 src/include/storage/standby.h               |  2 --
 src/include/storage/standbydefs.h           | 16 +++++++----
 5 files changed, 21 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 81eff5f31c4..5e6812396de 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -19,28 +19,10 @@
 static void
 standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
-	int			i;
-
 	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
 					 xlrec->oldestRunningXid);
-	if (xlrec->xcnt > 0)
-	{
-		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
-		for (i = 0; i < xlrec->xcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[i]);
-	}
-
-	if (xlrec->subxid_overflow)
-		appendStringInfoString(buf, "; subxid overflowed");
-
-	if (xlrec->subxcnt > 0)
-	{
-		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
-		for (i = 0; i < xlrec->subxcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
-	}
 }
 
 void
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 252526ecf91..eada641d2a4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1286,8 +1286,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial starting point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
@@ -1310,8 +1310,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial consistent point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 217b1670f5b..0f8a9aa0fea 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1337,9 +1337,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xl_running_xacts xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.xcnt = CurrRunningXacts->xcnt;
-	xlrec.subxcnt = CurrRunningXacts->subxcnt;
-	xlrec.subxid_overflow = (CurrRunningXacts->subxid_status != SUBXIDS_IN_ARRAY);
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
@@ -1347,31 +1344,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	/* Header */
 	XLogBeginInsert();
 	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData(&xlrec, MinSizeOfXactRunningXacts);
-
-	/* array of TransactionIds */
-	if (xlrec.xcnt > 0)
-		XLogRegisterData(CurrRunningXacts->xids,
-						 (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
+	XLogRegisterData(&xlrec, SizeOfXactRunningXacts);
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
-	if (xlrec.subxid_overflow)
-		elog(DEBUG2,
-			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
-	else
-		elog(DEBUG2,
-			 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+	elog(DEBUG2,
+		 "logging running transaction bounds (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+		 LSN_FORMAT_ARGS(recptr),
+		 CurrRunningXacts->oldestRunningXid,
+		 CurrRunningXacts->latestCompletedXid,
+		 CurrRunningXacts->nextXid);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 24e2f5082bc..d73a8f58a73 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -60,8 +60,6 @@ extern void StandbyReleaseLockTree(TransactionId xid,
 extern void StandbyReleaseAllLocks(void);
 extern void StandbyReleaseOldLocks(TransactionId oldxid);
 
-#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
-
 
 /*
  * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index 71e5ae878b5..3d182b66e74 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -42,20 +42,24 @@ typedef struct xl_standby_locks
 } xl_standby_locks;
 
 /*
- * When we write running xact data to WAL, we use this structure.
+ * Data included in an XLOG_RUNNING_XACTS record.
+ *
+ * This used to include a list of running XIDs, hence the name, but nowadays
+ * this only contains the min and max bounds of the transactions that were
+ * running when the record was written.  They are needed to initialize logical
+ * decoding.  They are also used in hot standby to prune information about old
+ * running transactions, in case the the primary didn't write a COMMIT/ABORT
+ * record for some reason.
  */
 typedef struct xl_running_xacts
 {
-	int			xcnt;			/* # of xact ids in xids[] */
-	int			subxcnt;		/* # of subxact ids in xids[] */
-	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* xid from TransamVariables->nextXid */
 	TransactionId oldestRunningXid; /* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
-
-	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
 
+#define SizeOfXactRunningXacts sizeof(xl_running_xacts)
+
 /*
  * Invalidations for standby, currently only when transactions without an
  * assigned xid commit.
-- 
2.39.5

v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patchtext/x-patch; charset=UTF-8; name=v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patchDownload

From 6b8e856c15750f89f9d559ae9f9fbd7f3f2db125 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 00:18:14 +0300
Subject: [PATCH v6 12/12] Add a cache to Snapshot to avoid repeated CSN
 lookups

Cache the status of all XIDs that have been looked up in the CSN log
in the SnapshotData. This avoids having to go the CSN log in the
common case that the same XIDs are looked up over and over again.
---
 src/backend/utils/time/snapmgr.c | 111 +++++++++++++++++++++++++++++--
 src/include/utils/snapshot.h     |   9 +++
 2 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 049c706f2cf..250ba1650e4 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -114,6 +114,35 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Define a radix tree implementation to cache CSN lookups in a snapshot.
+ *
+ * We need only one bit of information for each XID stored in the cache: was
+ * the XID running or not.  However, the radix tree implementation uses 8
+ * bytes for each entry (on 64-bit machines) even if the value type is smaller
+ * than that.  To reduce memory usage, we use uint64 as the value type, and
+ * store multiple XIDs in each value.
+ *
+ * The 64-bit value word holds two bits for each XID: whether the XID is
+ * present in the cache or not, and if it's present, whether it's considered
+ * as in-progress by the snapshot or not.  So each entry in the radix tree
+ * holds the status for 32 XIDs.
+ */
+#define RT_PREFIX inprogress_cache
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define INPROGRESS_CACHE_BITS 2
+#define INPROGRESS_CACHE_XIDS_PER_WORD 32
+
+#define INPROGRESS_CACHE_XID_IS_CACHED(word, slotno) \
+	((((word) & (UINT64CONST(1) << (slotno)))) != 0)
+
+#define INPROGRESS_CACHE_XID_IS_IN_PROGRESS(word, slotno) \
+	((((word) & (UINT64CONST(1) << ((slotno) + 1)))) != 0)
 
 /*
  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
@@ -240,6 +269,7 @@ typedef struct SerializedSnapshotData
 	int32		subxcnt;
 	bool		suboverflowed;
 	bool		takenDuringRecovery;
+	XLogRecPtr	snapshotCsn;
 	CommandId	curcid;
 } SerializedSnapshotData;
 
@@ -1177,6 +1207,7 @@ ExportSnapshot(MVCCSnapshotShared snapshot)
 			appendStringInfo(&buf, "sxp:%u\n", children[i]);
 	}
 	appendStringInfo(&buf, "rec:%u\n", snapshot->takenDuringRecovery);
+	appendStringInfo(&buf, "snapshotcsn:%X/%X\n", LSN_FORMAT_ARGS(snapshot->snapshotCsn));
 
 	/*
 	 * Now write the text representation into a file.  We first write to a
@@ -1449,6 +1480,7 @@ ImportSnapshot(const char *idstr)
 	}
 
 	snapshot->takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
+	snapshot->snapshotCsn = parseIntFromText("snapshotcsn:", &filebuf, path);
 
 	snapshot->refcount = 1;
 	valid_snapshots_push_out_of_order(snapshot);
@@ -1702,6 +1734,7 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 	serialized_snapshot.subxcnt = snapshot->shared->subxcnt;
 	serialized_snapshot.suboverflowed = snapshot->shared->suboverflowed;
 	serialized_snapshot.takenDuringRecovery = snapshot->shared->takenDuringRecovery;
+	serialized_snapshot.snapshotCsn = snapshot->shared->snapshotCsn;
 	serialized_snapshot.curcid = snapshot->curcid;
 
 	/*
@@ -1776,6 +1809,9 @@ RestoreSnapshot(char *start_address)
 	snapshot->shared->subxcnt = serialized_snapshot.subxcnt;
 	snapshot->shared->suboverflowed = serialized_snapshot.suboverflowed;
 	snapshot->shared->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
+	snapshot->shared->snapshotCsn = serialized_snapshot.snapshotCsn;
+	snapshot->shared->inprogress_cache = NULL;
+	snapshot->shared->inprogress_cache_cxt = NULL;
 	snapshot->shared->snapXactCompletionCount = 0;
 
 	snapshot->shared->refcount = 1;
@@ -1889,12 +1925,62 @@ XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot)
 	}
 	else
 	{
-		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
+		XLogRecPtr	csn;
+		bool		inprogress;
+		uint64	   *cache_entry;
+		uint64		cache_word = 0;
 
-		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
-			return false;
+		/*
+		 * Calculate the word and bit slot for the XID in the cache. We use an
+		 * offset from xmax as the key instead of the XID directly, because
+		 * the radix tree can compact away leading zeros and is thus more
+		 * efficient with keys closer to 0.
+		 */
+		uint32		cache_idx = snapshot->xmax - xid;
+		uint64		wordno = cache_idx / INPROGRESS_CACHE_XIDS_PER_WORD;
+		uint64		slotno = (cache_idx % INPROGRESS_CACHE_XIDS_PER_WORD) * INPROGRESS_CACHE_BITS;
+
+		if (snapshot->inprogress_cache)
+		{
+			cache_entry = inprogress_cache_find(snapshot->inprogress_cache, wordno);
+			if (cache_entry)
+			{
+				cache_word = *cache_entry;
+				if (INPROGRESS_CACHE_XID_IS_CACHED(cache_word, slotno))
+					return INPROGRESS_CACHE_XID_IS_IN_PROGRESS(cache_word, slotno);
+			}
+		}
 		else
-			return true;
+		{
+			MemoryContext save_cxt;
+
+			save_cxt = MemoryContextSwitchTo(TopMemoryContext);
+
+			if (snapshot->inprogress_cache_cxt == NULL)
+				snapshot->inprogress_cache_cxt =
+					AllocSetContextCreate(TopMemoryContext,
+										  "snapshot inprogress cache context",
+										  ALLOCSET_SMALL_SIZES);
+			snapshot->inprogress_cache = inprogress_cache_create(snapshot->inprogress_cache_cxt);
+			cache_entry = NULL;
+			MemoryContextSwitchTo(save_cxt);
+		}
+
+		/* Not found in cache, look up the CSN */
+		csn = CSNLogGetCSNByXid(xid);
+		inprogress = (csn == InvalidXLogRecPtr || csn > snapshot->snapshotCsn);
+
+		/* Update the cache word, and store it back to the radix tree */
+		cache_word |= UINT64CONST(1) << slotno; /* cached */
+		if (inprogress)
+			cache_word |= UINT64CONST(1) << (slotno + 1);	/* in-progress */
+
+		if (cache_entry)
+			*cache_entry = cache_word;
+		else
+			inprogress_cache_set(snapshot->inprogress_cache, wordno, &cache_word);
+
+		return inprogress;
 	}
 
 	return false;
@@ -1944,6 +2030,9 @@ AllocMVCCSnapshotShared(void)
 
 	shared->snapXactCompletionCount = 0;
 	shared->refcount = 0;
+	shared->snapshotCsn = InvalidXLogRecPtr;
+	shared->inprogress_cache = NULL;
+	shared->inprogress_cache_cxt = NULL;
 
 	MemoryContextSwitchTo(save_cxt);
 
@@ -1972,8 +2061,22 @@ void
 FreeMVCCSnapshotShared(MVCCSnapshotShared shared)
 {
 	Assert(shared->refcount == 0);
+
+	if (shared->inprogress_cache)
+	{
+		inprogress_cache_free(shared->inprogress_cache);
+		shared->inprogress_cache = NULL;
+	}
+	if (shared->inprogress_cache_cxt)
+	{
+		MemoryContextDelete(shared->inprogress_cache_cxt);
+		shared->inprogress_cache_cxt = NULL;
+	}
+
 	if (spareSnapshotShared == NULL)
+	{
 		spareSnapshotShared = shared;
+	}
 	else
 		pfree(shared);
 }
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 14ff80904c8..edf5bf1ba0a 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -129,6 +129,8 @@ typedef enum MVCCSnapshotKind
 	SNAPSHOT_REGISTERED,
 } MVCCSnapshotKind;
 
+struct inprogress_cache_radix_tree; /* private to snapmgr.c */
+
 /*
  * Struct representing a normal MVCC snapshot.
  *
@@ -194,6 +196,13 @@ typedef struct MVCCSnapshotSharedData
 	 */
 	XLogRecPtr	snapshotCsn;
 
+	/*
+	 * Cache of XIDs known to be running or not according to the snapshot.
+	 * Used in snapshots taken during recovery.
+	 */
+	struct inprogress_cache_radix_tree *inprogress_cache;
+	MemoryContext inprogress_cache_cxt;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 
 	/*
-- 
2.39.5

#14

贾明伟

wei19860922@163.com

9 months ago

In reply to: Heikki Linnakangas (#13)

13 attachment(s)

Re: CSN snapshots in hot standby

Hi,

Thanks for the proposal, it's an interesting approach.

I have a question regarding the xid visibility during standby startup.
If the checkpoint’s `oldestActiveXid` is smaller than `nextXid`, then there may be already-committed transactions
in that range which will not be replayed on standby. In that case, I believe clog needs to be used for visibility
checks within that xid range — is that correct?
On top of your previous discussion, I wrote a test case and attempted to fix the issue.
Patches 0001–0012 are your original commits, unchanged. Patch 0013 contains my own ideas.

Since this is my first time replying to the mailing list, I was worried about breaking the thread, so I’ve included everything as attachments instead.

Looking forward to your thoughts.
Best regards,
Mingwei Jia

Show quoted text

2025年4月1日 05:31，Heikki Linnakangas <hlinnaka@iki.fi> 写道：

Here's a new patchset version. Not much has changed in the actual CSN patches. But I spent a lot of time refactoring the snapshot management code, so that there is a simple place to add the "inprogress XID cache" for the CSN snapshots, in a way that avoids duplicating the cache if a snapshot is copied around.

Patches 0001-0002 are the patches I posted on a separate thread earlier. See /messages/by-id/ec10d398-c9b3-4542-8095-5fc6408b17d1@iki.fi.

Patches 0003-0006 contain more snapshot manager changes. The end state is that an MVCC snapshot consists of two structs: a shared "inner" struct that contains xmin, xmax and the XID lists, and an "outer" struct that contains a pointer to the shared struct and the current command ID. As a snapshot is copied around, all the copies share the same shared, reference-counted struct.

The rest of the patches are the same CSN patches I posted before, rebased over the snapshot manager changes.

There's one thing that hasn't been discussed yet: The ProcArrayRecoveryEndTransaction() function, which replaces ExpireTreeKnownAssignedTransactionIds() and is called on replay of every commit/abort record, does this:

/*
* If this was the oldest XID that was still running, advance it. This is
* important for advancing the global xmin, which avoids unnecessary
* recovery conflicts
*
* No locking required because this runs in the startup process.
*
* XXX: the caller actually has a list of XIDs that just committed. We
* could save some clog lookups by taking advantage of that list.
*/
oldest_running_primary_xid = procArray->oldest_running_primary_xid;
while (oldest_running_primary_xid < max_xid)
{
if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
!TransactionIdDidAbort(oldest_running_primary_xid))
{
break;
}
TransactionIdAdvance(oldest_running_primary_xid);
}
if (max_xid == oldest_running_primary_xid)
TransactionIdAdvance(oldest_running_primary_xid);

The point is to maintain an "oldest xmin" value based on the WAL records that are being replayed. Whenever the currently oldest running XID finishes, we scan the CLOG to find the next oldest XID that hasn't completed yet.

That adds approximately one or two CLOG lookup to every commit record replay on average. I haven't tried measuring that, but it seems like it could slow down recovery. There are ways that could be improved. For example, do it in larger batches.

A bunch of other small XXX comments remain, but they're just markers for comments that need to be adjusted, or for further cleanups that are now possible.

There are also several ways the inprogress cache could be made more efficient, which I haven't explored:

- For each XID in the cache radix tree, we store one bit to indicate whether the lookup has been performed, i.e. if the cache is valid for the XID, and another bit to indicate if the XID is visible or not. With 64-bit cache words stored in the radix tree, each cache word can store the status of 32 transactions. It would probably be better to work in bigger chunks. For example, when doing a lookup in the cache, check the status of 64 transactions at once. Assuming they're all stored on the same CSN page, it would not be much more expensive than a single XID lookup. That would make the cache 2x more compact, and save on future lookups of XIDS falling on the same cache word.

- Initializing the radix tree cache is fairly expensive, with several memory allocations. Many of those allocations could be done lazily with some effort in radixtree.h.

- Or start the cache as a small array of XIDs, and switch to the radix tree only after it fills up.

--
Heikki Linnakangas
Neon (https://neon.tech)
<v6-0001-Split-SnapshotData-into-separate-structs-for-each.patch><v6-0002-Simplify-historic-snapshot-refcounting.patch><v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patch><v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patch><v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patch><v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patch><v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patch><v6-0008-XXX-add-perf-test.patch><v6-0009-Use-CSN-snapshots-during-Hot-Standby.patch><v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch><v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patch><v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch>

Attachments:

v7-0001-Split-SnapshotData-into-separate-structs-for-each.patchapplication/octet-stream; name=v7-0001-Split-SnapshotData-into-separate-structs-for-each.patch; x-unix-mode=0644Download

From c2b5bc5f1f2cd959c695a91bd2eec047440426fc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 20 Dec 2024 00:36:33 +0200
Subject: [PATCH v6 01/12] Split SnapshotData into separate structs for each
 kind of snapshot

The SnapshotData fields were repurposed for different uses depending
the kind of snapshot. Split it into separate structs for different
kinds of snapshots, so that it is more clear which fields are used
with which snapshot kind, and the fields can have more descriptive
names.
---
 contrib/amcheck/verify_heapam.c               |   2 +-
 contrib/amcheck/verify_nbtree.c               |   2 +-
 src/backend/access/heap/heapam.c              |   3 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/heap/heapam_visibility.c   |  24 +--
 src/backend/access/index/indexam.c            |  11 +-
 src/backend/access/nbtree/nbtinsert.c         |   4 +-
 src/backend/access/spgist/spgvacuum.c         |   2 +-
 src/backend/access/table/tableam.c            |   8 +-
 src/backend/access/transam/parallel.c         |  14 +-
 src/backend/catalog/pg_inherits.c             |   2 +-
 src/backend/commands/async.c                  |   4 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/commands/tablecmds.c              |   2 +-
 src/backend/executor/execIndexing.c           |   4 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/partitioning/partdesc.c           |   2 +-
 src/backend/replication/logical/decode.c      |   2 +-
 src/backend/replication/logical/origin.c      |   4 +-
 .../replication/logical/reorderbuffer.c       | 114 +++++-----
 src/backend/replication/logical/snapbuild.c   | 114 +++++-----
 src/backend/replication/walsender.c           |   2 +-
 src/backend/storage/ipc/procarray.c           |   6 +-
 src/backend/storage/lmgr/predicate.c          |  32 +--
 src/backend/utils/adt/xid8funcs.c             |   4 +-
 src/backend/utils/time/snapmgr.c              | 198 +++++++++++-------
 src/include/access/heapam.h                   |   2 +-
 src/include/access/relscan.h                  |   6 +-
 src/include/replication/reorderbuffer.h       |  12 +-
 src/include/replication/snapbuild.h           |   6 +-
 src/include/replication/snapbuild_internal.h  |   2 +-
 src/include/storage/predicate.h               |   4 +-
 src/include/storage/procarray.h               |   2 +-
 src/include/utils/snapmgr.h                   |  16 +-
 src/include/utils/snapshot.h                  | 155 +++++++++-----
 src/tools/pgindent/typedefs.list              |   4 +
 36 files changed, 451 insertions(+), 336 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 1970fc8620a..6665cafc179 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -310,7 +310,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
 	 * while we're running.
 	 */
-	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+	ctx.safe_xmin = GetTransactionSnapshot()->mvcc.xmin;
 
 	/*
 	 * If we report corruption when not examining some individual attribute,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..e90b4a2ad5a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -458,7 +458,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 */
 			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
 				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
+									   snapshot->mvcc.xmin))
 				ereport(ERROR,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6e433db039e..0cfa100cbd1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -605,7 +605,8 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	 * full page write. Until we can prove that beyond doubt, let's check each
 	 * tuple for visibility the hard way.
 	 */
-	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
+	all_visible = PageIsAllVisible(page) &&
+		(snapshot->snapshot_type != SNAPSHOT_MVCC || !snapshot->mvcc.takenDuringRecovery);
 	check_serializable =
 		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..fce657f00f6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -390,7 +390,7 @@ tuple_lock_retry:
 
 		if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
 		{
-			SnapshotData SnapshotDirty;
+			DirtySnapshotData SnapshotDirty;
 			TransactionId priorXmax;
 
 			/* it was updated, so look at the updated version */
@@ -415,7 +415,7 @@ tuple_lock_retry:
 							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 				tuple->t_self = *tid;
-				if (heap_fetch(relation, &SnapshotDirty, tuple, &buffer, true))
+				if (heap_fetch(relation, (Snapshot) &SnapshotDirty, tuple, &buffer, true))
 				{
 					/*
 					 * If xmin isn't what we're expecting, the slot must have
@@ -2308,7 +2308,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
 
 	page = (Page) BufferGetPage(hscan->rs_cbuf);
 	all_visible = PageIsAllVisible(page) &&
-		!scan->rs_snapshot->takenDuringRecovery;
+		(scan->rs_snapshot->snapshot_type != SNAPSHOT_MVCC || !scan->rs_snapshot->mvcc.takenDuringRecovery);
 	maxoffset = PageGetMaxOffsetNumber(page);
 
 	for (;;)
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..f5d69b558f1 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -740,7 +740,7 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
  * token is also returned in snapshot->speculativeToken.
  */
 static bool
-HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesDirty(HeapTuple htup, DirtySnapshotData *snapshot,
 						Buffer buffer)
 {
 	HeapTupleHeader tuple = htup->t_data;
@@ -957,7 +957,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * and more contention on ProcArrayLock.
  */
 static bool
-HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 					   Buffer buffer)
 {
 	HeapTupleHeader tuple = htup->t_data;
@@ -1435,7 +1435,7 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
  *	snapshot->vistest must have been set up with the horizon to use.
  */
 static bool
-HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesNonVacuumable(HeapTuple htup, NonVacuumableSnapshotData *snapshot,
 								Buffer buffer)
 {
 	TransactionId dead_after = InvalidTransactionId;
@@ -1593,7 +1593,7 @@ TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
  * complicated than when dealing "only" with the present.
  */
 static bool
-HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
+HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, HistoricMVCCSnapshot snapshot,
 							   Buffer buffer)
 {
 	HeapTupleHeader tuple = htup->t_data;
@@ -1610,7 +1610,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return false;
 	}
 	/* check if it's one of our txids, toplevel is also in there */
-	else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+	else if (TransactionIdInArray(xmin, snapshot->curxip, snapshot->curxcnt))
 	{
 		bool		resolved;
 		CommandId	cmin = HeapTupleHeaderGetRawCommandId(tuple);
@@ -1669,7 +1669,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return false;
 	}
 	/* check if it's a committed transaction in [xmin, xmax) */
-	else if (TransactionIdInArray(xmin, snapshot->xip, snapshot->xcnt))
+	else if (TransactionIdInArray(xmin, snapshot->committed_xids, snapshot->xcnt))
 	{
 		/* fall through */
 	}
@@ -1702,7 +1702,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 	}
 
 	/* check if it's one of our txids, toplevel is also in there */
-	if (TransactionIdInArray(xmax, snapshot->subxip, snapshot->subxcnt))
+	if (TransactionIdInArray(xmax, snapshot->curxip, snapshot->curxcnt))
 	{
 		bool		resolved;
 		CommandId	cmin;
@@ -1755,7 +1755,7 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 	else if (TransactionIdFollowsOrEquals(xmax, snapshot->xmax))
 		return true;
 	/* xmax is between [xmin, xmax), check known committed array */
-	else if (TransactionIdInArray(xmax, snapshot->xip, snapshot->xcnt))
+	else if (TransactionIdInArray(xmax, snapshot->committed_xids, snapshot->xcnt))
 		return false;
 	/* xmax is between [xmin, xmax), but known not to have committed yet */
 	else
@@ -1778,7 +1778,7 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 	switch (snapshot->snapshot_type)
 	{
 		case SNAPSHOT_MVCC:
-			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesMVCC(htup, &snapshot->mvcc, buffer);
 		case SNAPSHOT_SELF:
 			return HeapTupleSatisfiesSelf(htup, snapshot, buffer);
 		case SNAPSHOT_ANY:
@@ -1786,11 +1786,11 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 		case SNAPSHOT_TOAST:
 			return HeapTupleSatisfiesToast(htup, snapshot, buffer);
 		case SNAPSHOT_DIRTY:
-			return HeapTupleSatisfiesDirty(htup, snapshot, buffer);
+			return HeapTupleSatisfiesDirty(htup, &snapshot->dirty, buffer);
 		case SNAPSHOT_HISTORIC_MVCC:
-			return HeapTupleSatisfiesHistoricMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesHistoricMVCC(htup, &snapshot->historic_mvcc, buffer);
 		case SNAPSHOT_NON_VACUUMABLE:
-			return HeapTupleSatisfiesNonVacuumable(htup, snapshot, buffer);
+			return HeapTupleSatisfiesNonVacuumable(htup, &snapshot->nonvacuumable, buffer);
 	}
 
 	return false;				/* keep compiler quiet */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 55ec4c10352..769170a37d5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -469,7 +469,7 @@ index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
 	RELATION_CHECKS;
 
 	nbytes = offsetof(ParallelIndexScanDescData, ps_snapshot_data);
-	nbytes = add_size(nbytes, EstimateSnapshotSpace(snapshot));
+	nbytes = add_size(nbytes, EstimateSnapshotSpace(&snapshot->mvcc));
 	nbytes = MAXALIGN(nbytes);
 
 	if (instrument)
@@ -517,16 +517,17 @@ index_parallelscan_initialize(Relation heapRelation, Relation indexRelation,
 	Assert(instrument || parallel_aware);
 
 	RELATION_CHECKS;
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 
 	offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
-					  EstimateSnapshotSpace(snapshot));
+					  EstimateSnapshotSpace((MVCCSnapshot) snapshot));
 	offset = MAXALIGN(offset);
 
 	target->ps_locator = heapRelation->rd_locator;
 	target->ps_indexlocator = indexRelation->rd_locator;
 	target->ps_offset_ins = 0;
 	target->ps_offset_am = 0;
-	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+	SerializeSnapshot((MVCCSnapshot) snapshot, target->ps_snapshot_data);
 
 	if (instrument)
 	{
@@ -590,8 +591,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
 
-	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
-	RegisterSnapshot(snapshot);
+	snapshot = (Snapshot) RestoreSnapshot(pscan->ps_snapshot_data);
+	snapshot = RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
 									pscan, true);
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index aa82cede30a..714e4ee3f0b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -413,7 +413,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	IndexTuple	curitup = NULL;
 	ItemId		curitemid = NULL;
 	BTScanInsert itup_key = insertstate->itup_key;
-	SnapshotData SnapshotDirty;
+	DirtySnapshotData SnapshotDirty;
 	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
@@ -558,7 +558,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 * index entry for the entire chain.
 				 */
 				else if (table_index_fetch_tuple_check(heapRel, &htid,
-													   &SnapshotDirty,
+													   (Snapshot) &SnapshotDirty,
 													   &all_dead))
 				{
 					TransactionId xwait;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index b3df2d89074..850ad36cd0a 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -808,7 +808,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
+	bds->myXmin = GetActiveSnapshot()->mvcc.xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..4eb81e40d99 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -133,7 +133,7 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 	Size		sz = 0;
 
 	if (IsMVCCSnapshot(snapshot))
-		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
+		sz = add_size(sz, EstimateSnapshotSpace((MVCCSnapshot) snapshot));
 	else
 		Assert(snapshot == SnapshotAny);
 
@@ -152,7 +152,7 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
 
 	if (IsMVCCSnapshot(snapshot))
 	{
-		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
+		SerializeSnapshot((MVCCSnapshot) snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
 	}
 	else
@@ -174,8 +174,8 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 	if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
-		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
-		RegisterSnapshot(snapshot);
+		snapshot = (Snapshot) RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
+		snapshot = RegisterSnapshot(snapshot);
 		flags |= SO_TEMP_SNAPSHOT;
 	}
 	else
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..8046e14abf7 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -275,10 +275,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_estimate_chunk(&pcxt->estimator, combocidlen);
 		if (IsolationUsesXactSnapshot())
 		{
-			tsnaplen = EstimateSnapshotSpace(transaction_snapshot);
+			tsnaplen = EstimateSnapshotSpace((MVCCSnapshot) transaction_snapshot);
 			shm_toc_estimate_chunk(&pcxt->estimator, tsnaplen);
 		}
-		asnaplen = EstimateSnapshotSpace(active_snapshot);
+		asnaplen = EstimateSnapshotSpace((MVCCSnapshot) active_snapshot);
 		shm_toc_estimate_chunk(&pcxt->estimator, asnaplen);
 		tstatelen = EstimateTransactionStateSpace();
 		shm_toc_estimate_chunk(&pcxt->estimator, tstatelen);
@@ -400,14 +400,14 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		if (IsolationUsesXactSnapshot())
 		{
 			tsnapspace = shm_toc_allocate(pcxt->toc, tsnaplen);
-			SerializeSnapshot(transaction_snapshot, tsnapspace);
+			SerializeSnapshot((MVCCSnapshot) transaction_snapshot, tsnapspace);
 			shm_toc_insert(pcxt->toc, PARALLEL_KEY_TRANSACTION_SNAPSHOT,
 						   tsnapspace);
 		}
 
 		/* Serialize the active snapshot. */
 		asnapspace = shm_toc_allocate(pcxt->toc, asnaplen);
-		SerializeSnapshot(active_snapshot, asnapspace);
+		SerializeSnapshot((MVCCSnapshot) active_snapshot, asnapspace);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ACTIVE_SNAPSHOT, asnapspace);
 
 		/* Provide the handle for per-session segment. */
@@ -1493,9 +1493,9 @@ ParallelWorkerMain(Datum main_arg)
 	 */
 	asnapspace = shm_toc_lookup(toc, PARALLEL_KEY_ACTIVE_SNAPSHOT, false);
 	tsnapspace = shm_toc_lookup(toc, PARALLEL_KEY_TRANSACTION_SNAPSHOT, true);
-	asnapshot = RestoreSnapshot(asnapspace);
-	tsnapshot = tsnapspace ? RestoreSnapshot(tsnapspace) : asnapshot;
-	RestoreTransactionSnapshot(tsnapshot,
+	asnapshot = (Snapshot) RestoreSnapshot(asnapspace);
+	tsnapshot = tsnapspace ? (Snapshot) RestoreSnapshot(tsnapspace) : asnapshot;
+	RestoreTransactionSnapshot((MVCCSnapshot) tsnapshot,
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 929bb53b620..b658601bf77 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -148,7 +148,7 @@ find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
 				xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
 				snap = GetActiveSnapshot();
 
-				if (!XidInMVCCSnapshot(xmin, snap))
+				if (!XidInMVCCSnapshot(xmin, (MVCCSnapshot) snap))
 				{
 					if (detached_xmin)
 					{
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..1ffb6f5fa70 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2022,6 +2022,8 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	bool		reachedEndOfPage;
 	AsyncQueueEntry *qe;
 
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
 	do
 	{
 		QueuePosition thisentry = *current;
@@ -2041,7 +2043,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 		/* Ignore messages destined for other databases */
 		if (qe->dboid == MyDatabaseId)
 		{
-			if (XidInMVCCSnapshot(qe->xid, snapshot))
+			if (XidInMVCCSnapshot(qe->xid, (MVCCSnapshot) snapshot))
 			{
 				/*
 				 * The source transaction is still in progress, so we can't
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 33c2106c17c..da3e02398bb 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1761,7 +1761,7 @@ DefineIndex(Oid tableId,
 	 * they must wait for.  But first, save the snapshot's xmin to use as
 	 * limitXmin for GetCurrentVirtualXIDs().
 	 */
-	limitXmin = snapshot->xmin;
+	limitXmin = snapshot->mvcc.xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
@@ -4156,7 +4156,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * We can now do away with our active snapshot, we still need to save
 		 * the xmin limit to wait for older snapshots.
 		 */
-		limitXmin = snapshot->xmin;
+		limitXmin = snapshot->mvcc.xmin;
 
 		PopActiveSnapshot();
 		UnregisterSnapshot(snapshot);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 10624353b0a..c55b5a7a014 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -20797,7 +20797,7 @@ ATExecDetachPartitionFinalize(Relation rel, RangeVar *name)
 	 * all such queries are complete (otherwise we would present them with an
 	 * inconsistent view of catalogs).
 	 */
-	WaitForOlderSnapshots(snap->xmin, false);
+	WaitForOlderSnapshots(snap->mvcc.xmin, false);
 
 	DetachPartitionFinalize(rel, partRel, true, InvalidOid);
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index e3fe9b78bb5..a3955792729 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -717,7 +717,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(index);
 	IndexScanDesc index_scan;
 	ScanKeyData scankeys[INDEX_MAX_KEYS];
-	SnapshotData DirtySnapshot;
+	DirtySnapshotData DirtySnapshot;
 	int			i;
 	bool		conflict;
 	bool		found_self;
@@ -816,7 +816,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, (Snapshot) &DirtySnapshot, NULL, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index ede89ea3cf9..84aa7c3268c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -184,7 +184,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	ScanKeyData skey[INDEX_MAX_KEYS];
 	int			skey_attoff;
 	IndexScanDesc scan;
-	SnapshotData snap;
+	DirtySnapshotData snap;
 	TransactionId xwait;
 	Relation	idxrel;
 	bool		found;
@@ -202,7 +202,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, (Snapshot) &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -357,7 +357,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
 {
 	TupleTableSlot *scanslot;
 	TableScanDesc scan;
-	SnapshotData snap;
+	DirtySnapshotData snap;
 	TypeCacheEntry **eq;
 	TransactionId xwait;
 	bool		found;
@@ -369,7 +369,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
 
 	/* Start a heap scan. */
 	InitDirtySnapshot(snap);
-	scan = table_beginscan(rel, &snap, 0, NULL);
+	scan = table_beginscan(rel, (Snapshot) &snap, 0, NULL);
 	scanslot = table_slot_create(rel, NULL);
 
 retry:
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 328b4d450e4..7c15c634181 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -102,7 +102,7 @@ RelationGetPartitionDesc(Relation rel, bool omit_detached)
 		Assert(TransactionIdIsValid(rel->rd_partdesc_nodetached_xmin));
 		activesnap = GetActiveSnapshot();
 
-		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, activesnap))
+		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, &activesnap->mvcc))
 			return rel->rd_partdesc_nodetached;
 	}
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..6a428e9720e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -586,7 +586,7 @@ logicalmsg_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	TransactionId xid = XLogRecGetXid(r);
 	uint8		info = XLogRecGetInfo(r) & ~XLR_INFO_MASK;
 	RepOriginId origin_id = XLogRecGetOrigin(r);
-	Snapshot	snapshot = NULL;
+	HistoricMVCCSnapshot snapshot = NULL;
 	xl_logical_message *message;
 
 	if (info != XLOG_LOGICAL_MESSAGE)
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 6583dd497da..51fc6460251 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -260,7 +260,7 @@ replorigin_create(const char *roname)
 	HeapTuple	tuple = NULL;
 	Relation	rel;
 	Datum		roname_d;
-	SnapshotData SnapshotDirty;
+	DirtySnapshotData SnapshotDirty;
 	SysScanDesc scan;
 	ScanKeyData key;
 
@@ -302,7 +302,7 @@ replorigin_create(const char *roname)
 
 		scan = systable_beginscan(rel, ReplicationOriginIdentIndex,
 								  true /* indexOK */ ,
-								  &SnapshotDirty,
+								  (Snapshot) &SnapshotDirty,
 								  1, &key);
 
 		collides = HeapTupleIsValid(systable_getnext(scan));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 977fbcd2474..e8196a8d5d5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -268,9 +268,9 @@ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
 
-static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
-static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
-									  ReorderBufferTXN *txn, CommandId cid);
+static void ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap);
+static HistoricMVCCSnapshot ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
+												  ReorderBufferTXN *txn, CommandId cid);
 
 /*
  * ---------------------------------------
@@ -852,7 +852,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
  */
 void
 ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
-						  Snapshot snap, XLogRecPtr lsn,
+						  HistoricMVCCSnapshot snap, XLogRecPtr lsn,
 						  bool transactional, const char *prefix,
 						  Size message_size, const char *message)
 {
@@ -886,7 +886,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 	else
 	{
 		ReorderBufferTXN *txn = NULL;
-		volatile Snapshot snapshot_now = snap;
+		volatile	HistoricMVCCSnapshot snapshot_now = snap;
 
 		/* Non-transactional changes require a valid snapshot. */
 		Assert(snapshot_now);
@@ -1886,55 +1886,55 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * that catalog modifying transactions can look into intermediate catalog
  * states.
  */
-static Snapshot
-ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
+static HistoricMVCCSnapshot
+ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 					  ReorderBufferTXN *txn, CommandId cid)
 {
-	Snapshot	snap;
+	HistoricMVCCSnapshot snap;
 	dlist_iter	iter;
 	int			i = 0;
 	Size		size;
 
-	size = sizeof(SnapshotData) +
+	size = sizeof(HistoricMVCCSnapshotData) +
 		sizeof(TransactionId) * orig_snap->xcnt +
 		sizeof(TransactionId) * (txn->nsubtxns + 1);
 
 	snap = MemoryContextAllocZero(rb->context, size);
-	memcpy(snap, orig_snap, sizeof(SnapshotData));
+	memcpy(snap, orig_snap, sizeof(HistoricMVCCSnapshotData));
 
 	snap->copied = true;
-	snap->active_count = 1;		/* mark as active so nobody frees it */
+	snap->refcount = 1;			/* mark as active so nobody frees it */
 	snap->regd_count = 0;
-	snap->xip = (TransactionId *) (snap + 1);
+	snap->committed_xids = (TransactionId *) (snap + 1);
 
-	memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
+	memcpy(snap->committed_xids, orig_snap->committed_xids, sizeof(TransactionId) * snap->xcnt);
 
 	/*
-	 * snap->subxip contains all txids that belong to our transaction which we
+	 * snap->curxip contains all txids that belong to our transaction which we
 	 * need to check via cmin/cmax. That's why we store the toplevel
 	 * transaction in there as well.
 	 */
-	snap->subxip = snap->xip + snap->xcnt;
-	snap->subxip[i++] = txn->xid;
+	snap->curxip = snap->committed_xids + snap->xcnt;
+	snap->curxip[i++] = txn->xid;
 
 	/*
 	 * txn->nsubtxns isn't decreased when subtransactions abort, so count
 	 * manually. Since it's an upper boundary it is safe to use it for the
 	 * allocation above.
 	 */
-	snap->subxcnt = 1;
+	snap->curxcnt = 1;
 
 	dlist_foreach(iter, &txn->subtxns)
 	{
 		ReorderBufferTXN *sub_txn;
 
 		sub_txn = dlist_container(ReorderBufferTXN, node, iter.cur);
-		snap->subxip[i++] = sub_txn->xid;
-		snap->subxcnt++;
+		snap->curxip[i++] = sub_txn->xid;
+		snap->curxcnt++;
 	}
 
 	/* sort so we can bsearch() later */
-	qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
+	qsort(snap->curxip, snap->curxcnt, sizeof(TransactionId), xidComparator);
 
 	/* store the specified current CommandId */
 	snap->curcid = cid;
@@ -1946,7 +1946,7 @@ ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
  * Free a previously ReorderBufferCopySnap'ed snapshot
  */
 static void
-ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
+ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap)
 {
 	if (snap->copied)
 		pfree(snap);
@@ -2099,7 +2099,7 @@ ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
  */
 static inline void
 ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
-							 Snapshot snapshot_now, CommandId command_id)
+							 HistoricMVCCSnapshot snapshot_now, CommandId command_id)
 {
 	txn->command_id = command_id;
 
@@ -2144,7 +2144,7 @@ ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn)
  */
 static void
 ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
-					  Snapshot snapshot_now,
+					  HistoricMVCCSnapshot snapshot_now,
 					  CommandId command_id,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
@@ -2191,7 +2191,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void
 ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn,
-						volatile Snapshot snapshot_now,
+						volatile HistoricMVCCSnapshot snapshot_now,
 						volatile CommandId command_id,
 						bool streaming)
 {
@@ -2779,7 +2779,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	Snapshot	snapshot_now;
+	HistoricMVCCSnapshot snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
 	txn->final_lsn = commit_lsn;
@@ -3251,7 +3251,7 @@ ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
  */
 void
 ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
-						 XLogRecPtr lsn, Snapshot snap)
+						 XLogRecPtr lsn, HistoricMVCCSnapshot snap)
 {
 	ReorderBufferChange *change = ReorderBufferAllocChange(rb);
 
@@ -3269,7 +3269,7 @@ ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
  */
 void
 ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
-							 XLogRecPtr lsn, Snapshot snap)
+							 XLogRecPtr lsn, HistoricMVCCSnapshot snap)
 {
 	ReorderBufferTXN *txn;
 	bool		is_new;
@@ -4043,14 +4043,14 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
-				Snapshot	snap;
+				HistoricMVCCSnapshot snap;
 				char	   *data;
 
 				snap = change->data.snapshot;
 
-				sz += sizeof(SnapshotData) +
+				sz += sizeof(HistoricMVCCSnapshotData) +
 					sizeof(TransactionId) * snap->xcnt +
-					sizeof(TransactionId) * snap->subxcnt;
+					sizeof(TransactionId) * snap->curxcnt;
 
 				/* make sure we have enough space */
 				ReorderBufferSerializeReserve(rb, sz);
@@ -4058,21 +4058,21 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				/* might have been reallocated above */
 				ondisk = (ReorderBufferDiskChange *) rb->outbuf;
 
-				memcpy(data, snap, sizeof(SnapshotData));
-				data += sizeof(SnapshotData);
+				memcpy(data, snap, sizeof(HistoricMVCCSnapshotData));
+				data += sizeof(HistoricMVCCSnapshotData);
 
 				if (snap->xcnt)
 				{
-					memcpy(data, snap->xip,
+					memcpy(data, snap->committed_xids,
 						   sizeof(TransactionId) * snap->xcnt);
 					data += sizeof(TransactionId) * snap->xcnt;
 				}
 
-				if (snap->subxcnt)
+				if (snap->curxcnt)
 				{
-					memcpy(data, snap->subxip,
-						   sizeof(TransactionId) * snap->subxcnt);
-					data += sizeof(TransactionId) * snap->subxcnt;
+					memcpy(data, snap->curxip,
+						   sizeof(TransactionId) * snap->curxcnt);
+					data += sizeof(TransactionId) * snap->curxcnt;
 				}
 				break;
 			}
@@ -4177,7 +4177,7 @@ ReorderBufferCanStartStreaming(ReorderBuffer *rb)
 static void
 ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
-	Snapshot	snapshot_now;
+	HistoricMVCCSnapshot snapshot_now;
 	CommandId	command_id;
 	Size		stream_bytes;
 	bool		txn_is_streamed;
@@ -4196,10 +4196,10 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * After that we need to reuse the snapshot from the previous run.
 	 *
 	 * Unlike DecodeCommit which adds xids of all the subtransactions in
-	 * snapshot's xip array via SnapBuildCommitTxn, we can't do that here but
-	 * we do add them to subxip array instead via ReorderBufferCopySnap. This
-	 * allows the catalog changes made in subtransactions decoded till now to
-	 * be visible.
+	 * snapshot's committed_xids array via SnapBuildCommitTxn, we can't do
+	 * that here but we do add them to curxip array instead via
+	 * ReorderBufferCopySnap. This allows the catalog changes made in
+	 * subtransactions decoded till now to be visible.
 	 */
 	if (txn->snapshot_now == NULL)
 	{
@@ -4345,13 +4345,13 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
-				Snapshot	snap;
+				HistoricMVCCSnapshot snap;
 
 				snap = change->data.snapshot;
 
-				sz += sizeof(SnapshotData) +
+				sz += sizeof(HistoricMVCCSnapshotData) +
 					sizeof(TransactionId) * snap->xcnt +
-					sizeof(TransactionId) * snap->subxcnt;
+					sizeof(TransactionId) * snap->curxcnt;
 
 				break;
 			}
@@ -4629,24 +4629,24 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			{
-				Snapshot	oldsnap;
-				Snapshot	newsnap;
+				HistoricMVCCSnapshot oldsnap;
+				HistoricMVCCSnapshot newsnap;
 				Size		size;
 
-				oldsnap = (Snapshot) data;
+				oldsnap = (HistoricMVCCSnapshot) data;
 
-				size = sizeof(SnapshotData) +
+				size = sizeof(HistoricMVCCSnapshotData) +
 					sizeof(TransactionId) * oldsnap->xcnt +
-					sizeof(TransactionId) * (oldsnap->subxcnt + 0);
+					sizeof(TransactionId) * (oldsnap->curxcnt + 0);
 
 				change->data.snapshot = MemoryContextAllocZero(rb->context, size);
 
 				newsnap = change->data.snapshot;
 
 				memcpy(newsnap, data, size);
-				newsnap->xip = (TransactionId *)
-					(((char *) newsnap) + sizeof(SnapshotData));
-				newsnap->subxip = newsnap->xip + newsnap->xcnt;
+				newsnap->committed_xids = (TransactionId *)
+					(((char *) newsnap) + sizeof(HistoricMVCCSnapshotData));
+				newsnap->curxip = newsnap->committed_xids + newsnap->xcnt;
 				newsnap->copied = true;
 				break;
 			}
@@ -5316,7 +5316,7 @@ file_sort_by_lsn(const ListCell *a_p, const ListCell *b_p)
  * transaction for relid.
  */
 static void
-UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
+UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, HistoricMVCCSnapshot snapshot)
 {
 	DIR		   *mapping_dir;
 	struct dirent *mapping_de;
@@ -5364,7 +5364,7 @@ UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
 			continue;
 
 		/* not for our transaction */
-		if (!TransactionIdInArray(f_mapped_xid, snapshot->subxip, snapshot->subxcnt))
+		if (!TransactionIdInArray(f_mapped_xid, snapshot->curxip, snapshot->curxcnt))
 			continue;
 
 		/* ok, relevant, queue for apply */
@@ -5383,7 +5383,7 @@ UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
 		RewriteMappingFile *f = (RewriteMappingFile *) lfirst(file);
 
 		elog(DEBUG1, "applying mapping: \"%s\" in %u", f->fname,
-			 snapshot->subxip[0]);
+			 snapshot->curxip[0]);
 		ApplyLogicalMappingFile(tuplecid_data, relid, f->fname);
 		pfree(f);
 	}
@@ -5395,7 +5395,7 @@ UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
  */
 bool
 ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
-							  Snapshot snapshot,
+							  HistoricMVCCSnapshot snapshot,
 							  HeapTuple htup, Buffer buffer,
 							  CommandId *cmin, CommandId *cmax)
 {
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..7a341418a74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,11 +155,11 @@ static bool ExportInProgress = false;
 static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static HistoricMVCCSnapshot SnapBuildBuildSnapshot(SnapBuild *builder);
 
-static void SnapBuildFreeSnapshot(Snapshot snap);
+static void SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap);
 
-static void SnapBuildSnapIncRefcount(Snapshot snap);
+static void SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap);
 
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
@@ -249,23 +249,21 @@ FreeSnapshotBuilder(SnapBuild *builder)
  * Free an unreferenced snapshot that has previously been built by us.
  */
 static void
-SnapBuildFreeSnapshot(Snapshot snap)
+SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap)
 {
 	/* make sure we don't get passed an external snapshot */
 	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
 
 	/* make sure nobody modified our snapshot */
 	Assert(snap->curcid == FirstCommandId);
-	Assert(!snap->suboverflowed);
-	Assert(!snap->takenDuringRecovery);
 	Assert(snap->regd_count == 0);
 
 	/* slightly more likely, so it's checked even without c-asserts */
 	if (snap->copied)
 		elog(ERROR, "cannot free a copied snapshot");
 
-	if (snap->active_count)
-		elog(ERROR, "cannot free an active snapshot");
+	if (snap->refcount)
+		elog(ERROR, "cannot free a snapshot that's in use");
 
 	pfree(snap);
 }
@@ -313,9 +311,9 @@ SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
  * adding a Snapshot as builder->snapshot.
  */
 static void
-SnapBuildSnapIncRefcount(Snapshot snap)
+SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap)
 {
-	snap->active_count++;
+	snap->refcount++;
 }
 
 /*
@@ -325,26 +323,23 @@ SnapBuildSnapIncRefcount(Snapshot snap)
  * IncRef'ed Snapshot can adjust its refcount easily.
  */
 void
-SnapBuildSnapDecRefcount(Snapshot snap)
+SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap)
 {
 	/* make sure we don't get passed an external snapshot */
 	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
 
 	/* make sure nobody modified our snapshot */
 	Assert(snap->curcid == FirstCommandId);
-	Assert(!snap->suboverflowed);
-	Assert(!snap->takenDuringRecovery);
 
+	Assert(snap->refcount > 0);
 	Assert(snap->regd_count == 0);
 
-	Assert(snap->active_count > 0);
-
 	/* slightly more likely, so it's checked even without casserts */
 	if (snap->copied)
 		elog(ERROR, "cannot free a copied snapshot");
 
-	snap->active_count--;
-	if (snap->active_count == 0)
+	snap->refcount--;
+	if (snap->refcount == 0)
 		SnapBuildFreeSnapshot(snap);
 }
 
@@ -356,15 +351,15 @@ SnapBuildSnapDecRefcount(Snapshot snap)
  * these snapshots; they have to copy them and fill in appropriate ->curcid
  * and ->subxip/subxcnt values.
  */
-static Snapshot
+static HistoricMVCCSnapshot
 SnapBuildBuildSnapshot(SnapBuild *builder)
 {
-	Snapshot	snapshot;
+	HistoricMVCCSnapshot snapshot;
 	Size		ssize;
 
 	Assert(builder->state >= SNAPBUILD_FULL_SNAPSHOT);
 
-	ssize = sizeof(SnapshotData)
+	ssize = sizeof(HistoricMVCCSnapshotData)
 		+ sizeof(TransactionId) * builder->committed.xcnt
 		+ sizeof(TransactionId) * 1 /* toplevel xid */ ;
 
@@ -400,31 +395,28 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->xmax = builder->xmax;
 
 	/* store all transactions to be treated as committed by this snapshot */
-	snapshot->xip =
-		(TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
+	snapshot->committed_xids =
+		(TransactionId *) ((char *) snapshot + sizeof(HistoricMVCCSnapshotData));
 	snapshot->xcnt = builder->committed.xcnt;
-	memcpy(snapshot->xip,
+	memcpy(snapshot->committed_xids,
 		   builder->committed.xip,
 		   builder->committed.xcnt * sizeof(TransactionId));
 
 	/* sort so we can bsearch() */
-	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+	qsort(snapshot->committed_xids, snapshot->xcnt, sizeof(TransactionId), xidComparator);
 
 	/*
-	 * Initially, subxip is empty, i.e. it's a snapshot to be used by
+	 * Initially, curxip is empty, i.e. it's a snapshot to be used by
 	 * transactions that don't modify the catalog. Will be filled by
 	 * ReorderBufferCopySnap() if necessary.
 	 */
-	snapshot->subxcnt = 0;
-	snapshot->subxip = NULL;
+	snapshot->curxcnt = 0;
+	snapshot->curxip = NULL;
 
-	snapshot->suboverflowed = false;
-	snapshot->takenDuringRecovery = false;
 	snapshot->copied = false;
 	snapshot->curcid = FirstCommandId;
-	snapshot->active_count = 0;
+	snapshot->refcount = 0;
 	snapshot->regd_count = 0;
-	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
@@ -436,13 +428,13 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
  * The snapshot will be usable directly in current transaction or exported
  * for loading in different transaction.
  */
-Snapshot
+MVCCSnapshot
 SnapBuildInitialSnapshot(SnapBuild *builder)
 {
-	Snapshot	snap;
+	HistoricMVCCSnapshot historicsnap;
+	MVCCSnapshot mvccsnap;
 	TransactionId xid;
 	TransactionId safeXid;
-	TransactionId *newxip;
 	int			newxcnt = 0;
 
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
@@ -464,10 +456,10 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	if (TransactionIdIsValid(MyProc->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
-	snap = SnapBuildBuildSnapshot(builder);
+	historicsnap = SnapBuildBuildSnapshot(builder);
 
 	/*
-	 * We know that snap->xmin is alive, enforced by the logical xmin
+	 * We know that historicsnap->xmin is alive, enforced by the logical xmin
 	 * mechanism. Due to that we can do this without locks, we're only
 	 * changing our own value.
 	 *
@@ -479,15 +471,18 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	safeXid = GetOldestSafeDecodingTransactionId(false);
 	LWLockRelease(ProcArrayLock);
 
-	if (TransactionIdFollows(safeXid, snap->xmin))
+	if (TransactionIdFollows(safeXid, historicsnap->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u",
-			 safeXid, snap->xmin);
+			 safeXid, historicsnap->xmin);
 
-	MyProc->xmin = snap->xmin;
+	MyProc->xmin = historicsnap->xmin;
 
 	/* allocate in transaction context */
-	newxip = (TransactionId *)
-		palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
+	mvccsnap = palloc(sizeof(MVCCSnapshotData) + sizeof(TransactionId) * GetMaxSnapshotXidCount());
+	mvccsnap->snapshot_type = SNAPSHOT_MVCC;
+	mvccsnap->xmin = historicsnap->xmin;
+	mvccsnap->xmax = historicsnap->xmax;
+	mvccsnap->xip = (TransactionId *) ((char *) mvccsnap + sizeof(MVCCSnapshotData));
 
 	/*
 	 * snapbuild.c builds transactions in an "inverted" manner, which means it
@@ -495,15 +490,15 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	 * classical snapshot by marking all non-committed transactions as
 	 * in-progress. This can be expensive.
 	 */
-	for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+	for (xid = historicsnap->xmin; NormalTransactionIdPrecedes(xid, historicsnap->xmax);)
 	{
 		void	   *test;
 
 		/*
-		 * Check whether transaction committed using the decoding snapshot
-		 * meaning of ->xip.
+		 * Check whether transaction committed using the decoding snapshot's
+		 * committed_xids array.
 		 */
-		test = bsearch(&xid, snap->xip, snap->xcnt,
+		test = bsearch(&xid, historicsnap->committed_xids, historicsnap->xcnt,
 					   sizeof(TransactionId), xidComparator);
 
 		if (test == NULL)
@@ -513,18 +508,27 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("initial slot snapshot too large")));
 
-			newxip[newxcnt++] = xid;
+			mvccsnap->xip[newxcnt++] = xid;
 		}
 
 		TransactionIdAdvance(xid);
 	}
-
-	/* adjust remaining snapshot fields as needed */
-	snap->snapshot_type = SNAPSHOT_MVCC;
-	snap->xcnt = newxcnt;
-	snap->xip = newxip;
-
-	return snap;
+	mvccsnap->xcnt = newxcnt;
+
+	/* Initialize remaining MVCCSnapshot fields */
+	mvccsnap->subxip = NULL;
+	mvccsnap->subxcnt = 0;
+	mvccsnap->suboverflowed = false;
+	mvccsnap->takenDuringRecovery = false;
+	mvccsnap->copied = true;
+	mvccsnap->curcid = FirstCommandId;
+	mvccsnap->active_count = 0;
+	mvccsnap->regd_count = 0;
+	mvccsnap->snapXactCompletionCount = 0;
+
+	pfree(historicsnap);
+
+	return mvccsnap;
 }
 
 /*
@@ -538,7 +542,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 const char *
 SnapBuildExportSnapshot(SnapBuild *builder)
 {
-	Snapshot	snap;
+	MVCCSnapshot snap;
 	char	   *snapname;
 
 	if (IsTransactionOrTransactionBlock())
@@ -575,7 +579,7 @@ SnapBuildExportSnapshot(SnapBuild *builder)
 /*
  * Ensure there is a snapshot and if not build one for current transaction.
  */
-Snapshot
+HistoricMVCCSnapshot
 SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
 {
 	Assert(builder->state == SNAPBUILD_CONSISTENT);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..1a7a35e25eb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1307,7 +1307,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		}
 		else if (snapshot_action == CRS_USE_SNAPSHOT)
 		{
-			Snapshot	snap;
+			MVCCSnapshot snap;
 
 			snap = SnapBuildInitialSnapshot(ctx->snapshot_builder);
 			RestoreTransactionSnapshot(snap, MyProc);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5b945a9ee3..535755614a9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2092,7 +2092,7 @@ GetMaxSnapshotSubxidCount(void)
  * least in the case we already hold a snapshot), but that's for another day.
  */
 static bool
-GetSnapshotDataReuse(Snapshot snapshot)
+GetSnapshotDataReuse(MVCCSnapshot snapshot)
 {
 	uint64		curXactCompletionCount;
 
@@ -2171,8 +2171,8 @@ GetSnapshotDataReuse(Snapshot snapshot)
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
  */
-Snapshot
-GetSnapshotData(Snapshot snapshot)
+MVCCSnapshot
+GetSnapshotData(MVCCSnapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..dd52782ff22 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -449,10 +449,10 @@ static void SerialSetActiveSerXmin(TransactionId xid);
 
 static uint32 predicatelock_hash(const void *key, Size keysize);
 static void SummarizeOldestCommittedSxact(void);
-static Snapshot GetSafeSnapshot(Snapshot origSnapshot);
-static Snapshot GetSerializableTransactionSnapshotInt(Snapshot snapshot,
-													  VirtualTransactionId *sourcevxid,
-													  int sourcepid);
+static MVCCSnapshot GetSafeSnapshot(MVCCSnapshot origSnapshot);
+static MVCCSnapshot GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
+														  VirtualTransactionId *sourcevxid,
+														  int sourcepid);
 static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
 static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
 									  PREDICATELOCKTARGETTAG *parent);
@@ -1544,10 +1544,10 @@ SummarizeOldestCommittedSxact(void)
  *		for), the passed-in Snapshot pointer should reference a static data
  *		area that can safely be passed to GetSnapshotData.
  */
-static Snapshot
-GetSafeSnapshot(Snapshot origSnapshot)
+static MVCCSnapshot
+GetSafeSnapshot(MVCCSnapshot origSnapshot)
 {
-	Snapshot	snapshot;
+	MVCCSnapshot snapshot;
 
 	Assert(XactReadOnly && XactDeferrable);
 
@@ -1668,8 +1668,8 @@ GetSafeSnapshotBlockingPids(int blocked_pid, int *output, int output_size)
  * always this same pointer; no new snapshot data structure is allocated
  * within this function.
  */
-Snapshot
-GetSerializableTransactionSnapshot(Snapshot snapshot)
+MVCCSnapshot
+GetSerializableTransactionSnapshot(MVCCSnapshot snapshot)
 {
 	Assert(IsolationIsSerializable());
 
@@ -1709,7 +1709,7 @@ GetSerializableTransactionSnapshot(Snapshot snapshot)
  * read-only.
  */
 void
-SetSerializableTransactionSnapshot(Snapshot snapshot,
+SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
 								   VirtualTransactionId *sourcevxid,
 								   int sourcepid)
 {
@@ -1750,8 +1750,8 @@ SetSerializableTransactionSnapshot(Snapshot snapshot,
  * source xact is still running after we acquire SerializableXactHashLock.
  * We do that by calling ProcArrayInstallImportedXmin.
  */
-static Snapshot
-GetSerializableTransactionSnapshotInt(Snapshot snapshot,
+static MVCCSnapshot
+GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 									  VirtualTransactionId *sourcevxid,
 									  int sourcepid)
 {
@@ -3961,12 +3961,12 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 static bool
 XidIsConcurrent(TransactionId xid)
 {
-	Snapshot	snap;
+	MVCCSnapshot snap;
 
 	Assert(TransactionIdIsValid(xid));
 	Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny()));
 
-	snap = GetTransactionSnapshot();
+	snap = (MVCCSnapshot) GetTransactionSnapshot();
 
 	if (TransactionIdPrecedes(xid, snap->xmin))
 		return false;
@@ -4214,7 +4214,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 		}
 		else if (!SxactIsDoomed(sxact)
 				 && (!SxactIsCommitted(sxact)
-					 || TransactionIdPrecedes(GetTransactionSnapshot()->xmin,
+					 || TransactionIdPrecedes(TransactionXmin,
 											  sxact->finishedBefore))
 				 && !RWConflictExists(sxact, MySerializableXact))
 		{
@@ -4227,7 +4227,7 @@ CheckTargetForConflictsIn(PREDICATELOCKTARGETTAG *targettag)
 			 */
 			if (!SxactIsDoomed(sxact)
 				&& (!SxactIsCommitted(sxact)
-					|| TransactionIdPrecedes(GetTransactionSnapshot()->xmin,
+					|| TransactionIdPrecedes(TransactionXmin,
 											 sxact->finishedBefore))
 				&& !RWConflictExists(sxact, MySerializableXact))
 			{
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index 1da3964ca6f..d4aa8ef9e4e 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -372,10 +372,10 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	pg_snapshot *snap;
 	uint32		nxip,
 				i;
-	Snapshot	cur;
+	MVCCSnapshot cur;
 	FullTransactionId next_fxid = ReadNextFullTransactionId();
 
-	cur = GetActiveSnapshot();
+	cur = (MVCCSnapshot) GetActiveSnapshot();
 	if (cur == NULL)
 		elog(ERROR, "no active snapshot set");
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..78adb6d575a 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -137,18 +137,18 @@
  * These SnapshotData structs are static to simplify memory allocation
  * (see the hack in GetSnapshotData to avoid repeated malloc/free).
  */
-static SnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
-static SnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
-static SnapshotData CatalogSnapshotData = {SNAPSHOT_MVCC};
+static MVCCSnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
+static MVCCSnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
+static MVCCSnapshotData CatalogSnapshotData = {SNAPSHOT_MVCC};
 SnapshotData SnapshotSelfData = {SNAPSHOT_SELF};
 SnapshotData SnapshotAnyData = {SNAPSHOT_ANY};
 SnapshotData SnapshotToastData = {SNAPSHOT_TOAST};
 
 /* Pointers to valid snapshots */
-static Snapshot CurrentSnapshot = NULL;
-static Snapshot SecondarySnapshot = NULL;
-static Snapshot CatalogSnapshot = NULL;
-static Snapshot HistoricSnapshot = NULL;
+static MVCCSnapshot CurrentSnapshot = NULL;
+static MVCCSnapshot SecondarySnapshot = NULL;
+static MVCCSnapshot CatalogSnapshot = NULL;
+static HistoricMVCCSnapshot HistoricSnapshot = NULL;
 
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
@@ -171,7 +171,7 @@ static HTAB *tuplecid_data = NULL;
  */
 typedef struct ActiveSnapshotElt
 {
-	Snapshot	as_snap;
+	MVCCSnapshot as_snap;
 	int			as_level;
 	struct ActiveSnapshotElt *as_next;
 } ActiveSnapshotElt;
@@ -196,7 +196,7 @@ bool		FirstSnapshotSet = false;
  * FirstSnapshotSet in combination with IsolationUsesXactSnapshot(), because
  * GUC may be reset before us, changing the value of IsolationUsesXactSnapshot.
  */
-static Snapshot FirstXactSnapshot = NULL;
+static MVCCSnapshot FirstXactSnapshot = NULL;
 
 /* Define pathname of exported-snapshot files */
 #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
@@ -205,16 +205,16 @@ static Snapshot FirstXactSnapshot = NULL;
 typedef struct ExportedSnapshot
 {
 	char	   *snapfile;
-	Snapshot	snapshot;
+	MVCCSnapshot snapshot;
 } ExportedSnapshot;
 
 /* Current xact's exported snapshots (a list of ExportedSnapshot structs) */
 static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
+static MVCCSnapshot CopyMVCCSnapshot(MVCCSnapshot snapshot);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
+static void FreeMVCCSnapshot(MVCCSnapshot snapshot);
 static void SnapshotResetXmin(void);
 
 /* ResourceOwner callbacks to track snapshot references */
@@ -308,8 +308,9 @@ GetTransactionSnapshot(void)
 				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
 			else
 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+
 			/* Make a saved copy */
-			CurrentSnapshot = CopySnapshot(CurrentSnapshot);
+			CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
 			FirstXactSnapshot = CurrentSnapshot;
 			/* Mark it as "registered" in FirstXactSnapshot */
 			FirstXactSnapshot->regd_count++;
@@ -319,18 +320,18 @@ GetTransactionSnapshot(void)
 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
 		FirstSnapshotSet = true;
-		return CurrentSnapshot;
+		return (Snapshot) CurrentSnapshot;
 	}
 
 	if (IsolationUsesXactSnapshot())
-		return CurrentSnapshot;
+		return (Snapshot) CurrentSnapshot;
 
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
-	return CurrentSnapshot;
+	return (Snapshot) CurrentSnapshot;
 }
 
 /*
@@ -361,7 +362,7 @@ GetLatestSnapshot(void)
 
 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
 
-	return SecondarySnapshot;
+	return (Snapshot) SecondarySnapshot;
 }
 
 /*
@@ -380,7 +381,7 @@ GetCatalogSnapshot(Oid relid)
 	 * finishing decoding.
 	 */
 	if (HistoricSnapshotActive())
-		return HistoricSnapshot;
+		return (Snapshot) HistoricSnapshot;
 
 	return GetNonHistoricCatalogSnapshot(relid);
 }
@@ -426,7 +427,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 	}
 
-	return CatalogSnapshot;
+	return (Snapshot) CatalogSnapshot;
 }
 
 /*
@@ -495,7 +496,7 @@ SnapshotSetCommandId(CommandId curcid)
  * in GetTransactionSnapshot.
  */
 static void
-SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
+SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid,
 					   int sourcepid, PGPROC *sourceproc)
 {
 	/* Caller should have checked this already */
@@ -574,7 +575,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 			SetSerializableTransactionSnapshot(CurrentSnapshot, sourcevxid,
 											   sourcepid);
 		/* Make a saved copy */
-		CurrentSnapshot = CopySnapshot(CurrentSnapshot);
+		CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
 		FirstXactSnapshot = CurrentSnapshot;
 		/* Mark it as "registered" in FirstXactSnapshot */
 		FirstXactSnapshot->regd_count++;
@@ -585,29 +586,27 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 }
 
 /*
- * CopySnapshot
+ * CopyMVCCSnapshot
  *		Copy the given snapshot.
  *
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
  * to 0.  The returned snapshot has the copied flag set.
  */
-static Snapshot
-CopySnapshot(Snapshot snapshot)
+static MVCCSnapshot
+CopyMVCCSnapshot(MVCCSnapshot snapshot)
 {
-	Snapshot	newsnap;
+	MVCCSnapshot newsnap;
 	Size		subxipoff;
 	Size		size;
 
-	Assert(snapshot != InvalidSnapshot);
-
 	/* We allocate any XID arrays needed in the same palloc block. */
-	size = subxipoff = sizeof(SnapshotData) +
+	size = subxipoff = sizeof(MVCCSnapshotData) +
 		snapshot->xcnt * sizeof(TransactionId);
 	if (snapshot->subxcnt > 0)
 		size += snapshot->subxcnt * sizeof(TransactionId);
 
-	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
-	memcpy(newsnap, snapshot, sizeof(SnapshotData));
+	newsnap = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
+	memcpy(newsnap, snapshot, sizeof(MVCCSnapshotData));
 
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
@@ -644,11 +643,11 @@ CopySnapshot(Snapshot snapshot)
 }
 
 /*
- * FreeSnapshot
+ * FreeMVCCSnapshot
  *		Free the memory associated with a snapshot.
  */
 static void
-FreeSnapshot(Snapshot snapshot)
+FreeMVCCSnapshot(MVCCSnapshot snapshot)
 {
 	Assert(snapshot->regd_count == 0);
 	Assert(snapshot->active_count == 0);
@@ -664,6 +663,8 @@ FreeSnapshot(Snapshot snapshot)
  * If the passed snapshot is a statically-allocated one, or it is possibly
  * subject to a future command counter update, create a new long-lived copy
  * with active refcount=1.  Otherwise, only increment the refcount.
+ *
+ * Only regular MVCC snaphots can be used as the active snapshot.
  */
 void
 PushActiveSnapshot(Snapshot snapshot)
@@ -682,9 +683,12 @@ PushActiveSnapshot(Snapshot snapshot)
 void
 PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 {
+	MVCCSnapshot origsnap;
 	ActiveSnapshotElt *newactive;
 
-	Assert(snapshot != InvalidSnapshot);
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+	origsnap = &snapshot->mvcc;
+
 	Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
 
 	newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt));
@@ -693,11 +697,11 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	 * Checking SecondarySnapshot is probably useless here, but it seems
 	 * better to be sure.
 	 */
-	if (snapshot == CurrentSnapshot || snapshot == SecondarySnapshot ||
-		!snapshot->copied)
-		newactive->as_snap = CopySnapshot(snapshot);
+	if (origsnap == CurrentSnapshot || origsnap == SecondarySnapshot ||
+		!origsnap->copied)
+		newactive->as_snap = CopyMVCCSnapshot(origsnap);
 	else
-		newactive->as_snap = snapshot;
+		newactive->as_snap = origsnap;
 
 	newactive->as_next = ActiveSnapshot;
 	newactive->as_level = snap_level;
@@ -718,7 +722,8 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 void
 PushCopiedSnapshot(Snapshot snapshot)
 {
-	PushActiveSnapshot(CopySnapshot(snapshot));
+	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+	PushActiveSnapshot((Snapshot) CopyMVCCSnapshot(&snapshot->mvcc));
 }
 
 /*
@@ -771,7 +776,7 @@ PopActiveSnapshot(void)
 
 	if (ActiveSnapshot->as_snap->active_count == 0 &&
 		ActiveSnapshot->as_snap->regd_count == 0)
-		FreeSnapshot(ActiveSnapshot->as_snap);
+		FreeMVCCSnapshot(ActiveSnapshot->as_snap);
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
@@ -788,7 +793,7 @@ GetActiveSnapshot(void)
 {
 	Assert(ActiveSnapshot != NULL);
 
-	return ActiveSnapshot->as_snap;
+	return (Snapshot) ActiveSnapshot->as_snap;
 }
 
 /*
@@ -805,7 +810,8 @@ ActiveSnapshotSet(void)
  * RegisterSnapshot
  *		Register a snapshot as being in use by the current resource owner
  *
- * If InvalidSnapshot is passed, it is not registered.
+ * Only regular MVCC snaphots and "historic" MVCC snapshots can be registered.
+ * InvalidSnapshot is also accepted, as a no-op.
  */
 Snapshot
 RegisterSnapshot(Snapshot snapshot)
@@ -821,25 +827,39 @@ RegisterSnapshot(Snapshot snapshot)
  *		As above, but use the specified resource owner
  */
 Snapshot
-RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner)
+RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 {
-	Snapshot	snap;
+	MVCCSnapshot snapshot;
 
-	if (snapshot == InvalidSnapshot)
+	if (orig_snapshot == InvalidSnapshot)
 		return InvalidSnapshot;
 
+	if (orig_snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
+	{
+		HistoricMVCCSnapshot historicsnap = &orig_snapshot->historic_mvcc;
+
+		ResourceOwnerEnlarge(owner);
+		historicsnap->regd_count++;
+		ResourceOwnerRememberSnapshot(owner, (Snapshot) historicsnap);
+
+		return (Snapshot) historicsnap;
+	}
+
+	Assert(orig_snapshot->snapshot_type == SNAPSHOT_MVCC);
+	snapshot = &orig_snapshot->mvcc;
+
 	/* Static snapshot?  Create a persistent copy */
-	snap = snapshot->copied ? snapshot : CopySnapshot(snapshot);
+	snapshot = snapshot->copied ? snapshot : CopyMVCCSnapshot(snapshot);
 
 	/* and tell resowner.c about it */
 	ResourceOwnerEnlarge(owner);
-	snap->regd_count++;
-	ResourceOwnerRememberSnapshot(owner, snap);
+	snapshot->regd_count++;
+	ResourceOwnerRememberSnapshot(owner, (Snapshot) snapshot);
 
-	if (snap->regd_count == 1)
-		pairingheap_add(&RegisteredSnapshots, &snap->ph_node);
+	if (snapshot->regd_count == 1)
+		pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
 
-	return snap;
+	return (Snapshot) snapshot;
 }
 
 /*
@@ -875,18 +895,41 @@ UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner)
 static void
 UnregisterSnapshotNoOwner(Snapshot snapshot)
 {
-	Assert(snapshot->regd_count > 0);
-	Assert(!pairingheap_is_empty(&RegisteredSnapshots));
+	if (snapshot->snapshot_type == SNAPSHOT_MVCC)
+	{
+		MVCCSnapshot mvccsnap = &snapshot->mvcc;
+
+		Assert(mvccsnap->regd_count > 0);
+		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
 
-	snapshot->regd_count--;
-	if (snapshot->regd_count == 0)
-		pairingheap_remove(&RegisteredSnapshots, &snapshot->ph_node);
+		mvccsnap->regd_count--;
+		if (mvccsnap->regd_count == 0)
+			pairingheap_remove(&RegisteredSnapshots, &mvccsnap->ph_node);
 
-	if (snapshot->regd_count == 0 && snapshot->active_count == 0)
+		if (mvccsnap->regd_count == 0 && mvccsnap->active_count == 0)
+		{
+			FreeMVCCSnapshot(mvccsnap);
+			SnapshotResetXmin();
+		}
+	}
+	else if (snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 	{
-		FreeSnapshot(snapshot);
-		SnapshotResetXmin();
+		HistoricMVCCSnapshot historicsnap = &snapshot->historic_mvcc;
+
+		/*
+		 * Historic snapshots don't rely on the resource owner machinery for
+		 * cleanup, the snapbuild.c machinery ensures that whenever a historic
+		 * snapshot is in use, it has a non-zero refcount.  Registration is
+		 * only supported so that the callers don't need to treat regular MVCC
+		 * catalog snapshots and historic snapshots differently.
+		 */
+		Assert(historicsnap->refcount > 0);
+
+		Assert(historicsnap->regd_count > 0);
+		historicsnap->regd_count--;
 	}
+	else
+		elog(ERROR, "registered snapshot has unexpected type");
 }
 
 /*
@@ -896,8 +939,8 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 static int
 xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	const SnapshotData *asnap = pairingheap_const_container(SnapshotData, ph_node, a);
-	const SnapshotData *bsnap = pairingheap_const_container(SnapshotData, ph_node, b);
+	const MVCCSnapshotData *asnap = pairingheap_const_container(MVCCSnapshotData, ph_node, a);
+	const MVCCSnapshotData *bsnap = pairingheap_const_container(MVCCSnapshotData, ph_node, b);
 
 	if (TransactionIdPrecedes(asnap->xmin, bsnap->xmin))
 		return 1;
@@ -923,7 +966,7 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 static void
 SnapshotResetXmin(void)
 {
-	Snapshot	minSnapshot;
+	MVCCSnapshot minSnapshot;
 
 	if (ActiveSnapshot != NULL)
 		return;
@@ -934,7 +977,7 @@ SnapshotResetXmin(void)
 		return;
 	}
 
-	minSnapshot = pairingheap_container(SnapshotData, ph_node,
+	minSnapshot = pairingheap_container(MVCCSnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
@@ -984,7 +1027,7 @@ AtSubAbort_Snapshot(int level)
 
 		if (ActiveSnapshot->as_snap->active_count == 0 &&
 			ActiveSnapshot->as_snap->regd_count == 0)
-			FreeSnapshot(ActiveSnapshot->as_snap);
+			FreeMVCCSnapshot(ActiveSnapshot->as_snap);
 
 		/* and free the stack element */
 		pfree(ActiveSnapshot);
@@ -1006,7 +1049,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * In transaction-snapshot mode we must release our privately-managed
 	 * reference to the transaction snapshot.  We must remove it from
 	 * RegisteredSnapshots to keep the check below happy.  But we don't bother
-	 * to do FreeSnapshot, for two reasons: the memory will go away with
+	 * to do FreeMVCCSnapshot, for two reasons: the memory will go away with
 	 * TopTransactionContext anyway, and if someone has left the snapshot
 	 * stacked as active, we don't want the code below to be chasing through a
 	 * dangling pointer.
@@ -1099,7 +1142,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
  *		snapshot.
  */
 char *
-ExportSnapshot(Snapshot snapshot)
+ExportSnapshot(MVCCSnapshot snapshot)
 {
 	TransactionId topXid;
 	TransactionId *children;
@@ -1163,7 +1206,7 @@ ExportSnapshot(Snapshot snapshot)
 	 * ensure that the snapshot's xmin is honored for the rest of the
 	 * transaction.
 	 */
-	snapshot = CopySnapshot(snapshot);
+	snapshot = CopyMVCCSnapshot(snapshot);
 
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 	esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
@@ -1280,7 +1323,7 @@ pg_export_snapshot(PG_FUNCTION_ARGS)
 {
 	char	   *snapshotName;
 
-	snapshotName = ExportSnapshot(GetActiveSnapshot());
+	snapshotName = ExportSnapshot((MVCCSnapshot) GetActiveSnapshot());
 	PG_RETURN_TEXT_P(cstring_to_text(snapshotName));
 }
 
@@ -1384,7 +1427,7 @@ ImportSnapshot(const char *idstr)
 	Oid			src_dbid;
 	int			src_isolevel;
 	bool		src_readonly;
-	SnapshotData snapshot;
+	MVCCSnapshotData snapshot;
 
 	/*
 	 * Must be at top level of a fresh transaction.  Note in particular that
@@ -1653,7 +1696,7 @@ HaveRegisteredOrActiveSnapshot(void)
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(HistoricMVCCSnapshot historic_snapshot, HTAB *tuplecids)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -1696,11 +1739,10 @@ HistoricSnapshotGetTupleCids(void)
  * SerializedSnapshotData.
  */
 Size
-EstimateSnapshotSpace(Snapshot snapshot)
+EstimateSnapshotSpace(MVCCSnapshot snapshot)
 {
 	Size		size;
 
-	Assert(snapshot != InvalidSnapshot);
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 
 	/* We allocate any XID arrays needed in the same palloc block. */
@@ -1720,7 +1762,7 @@ EstimateSnapshotSpace(Snapshot snapshot)
  *		memory location at start_address.
  */
 void
-SerializeSnapshot(Snapshot snapshot, char *start_address)
+SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
 
@@ -1776,12 +1818,12 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
  * to 0.  The returned snapshot has the copied flag set.
  */
-Snapshot
+MVCCSnapshot
 RestoreSnapshot(char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
 	Size		size;
-	Snapshot	snapshot;
+	MVCCSnapshot snapshot;
 	TransactionId *serialized_xids;
 
 	memcpy(&serialized_snapshot, start_address,
@@ -1790,12 +1832,12 @@ RestoreSnapshot(char *start_address)
 		(start_address + sizeof(SerializedSnapshotData));
 
 	/* We allocate any XID arrays needed in the same palloc block. */
-	size = sizeof(SnapshotData)
+	size = sizeof(MVCCSnapshotData)
 		+ serialized_snapshot.xcnt * sizeof(TransactionId)
 		+ serialized_snapshot.subxcnt * sizeof(TransactionId);
 
 	/* Copy all required fields */
-	snapshot = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
+	snapshot = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
 	snapshot->snapshot_type = SNAPSHOT_MVCC;
 	snapshot->xmin = serialized_snapshot.xmin;
 	snapshot->xmax = serialized_snapshot.xmax;
@@ -1840,7 +1882,7 @@ RestoreSnapshot(char *start_address)
  * the declaration for PGPROC.
  */
 void
-RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
+RestoreTransactionSnapshot(MVCCSnapshot snapshot, void *source_pgproc)
 {
 	SetTransactionSnapshot(snapshot, NULL, InvalidPid, source_pgproc);
 }
@@ -1856,7 +1898,7 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
  * XID could not be ours anyway.
  */
 bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot)
 {
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..3d3ea109a4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -431,7 +431,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
  */
 struct HTAB;
 extern bool ResolveCminCmaxDuringDecoding(struct HTAB *tuplecid_data,
-										  Snapshot snapshot,
+										  HistoricMVCCSnapshot snapshot,
 										  HeapTuple htup,
 										  Buffer buffer,
 										  CommandId *cmin, CommandId *cmax);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..2626f2996d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -34,7 +34,7 @@ typedef struct TableScanDescData
 {
 	/* scan parameters */
 	Relation	rs_rd;			/* heap relation descriptor */
-	struct SnapshotData *rs_snapshot;	/* snapshot to see */
+	union SnapshotData *rs_snapshot;	/* snapshot to see */
 	int			rs_nkeys;		/* number of scan keys */
 	struct ScanKeyData *rs_key; /* array of scan key descriptors */
 
@@ -135,7 +135,7 @@ typedef struct IndexScanDescData
 	/* scan parameters */
 	Relation	heapRelation;	/* heap relation descriptor, or NULL */
 	Relation	indexRelation;	/* index relation descriptor */
-	struct SnapshotData *xs_snapshot;	/* snapshot to see */
+	union SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
@@ -210,7 +210,7 @@ typedef struct SysScanDescData
 	Relation	irel;			/* NULL if doing heap scan */
 	struct TableScanDescData *scan; /* only valid in storage-scan case */
 	struct IndexScanDescData *iscan;	/* only valid in index-scan case */
-	struct SnapshotData *snapshot;	/* snapshot to unregister at end of scan */
+	union SnapshotData *snapshot;	/* snapshot to unregister at end of scan */
 	struct TupleTableSlot *slot;
 }			SysScanDescData;
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..8bf72c64c94 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -127,7 +127,7 @@ typedef struct ReorderBufferChange
 		}			msg;
 
 		/* New snapshot, set when action == *_INTERNAL_SNAPSHOT */
-		Snapshot	snapshot;
+		HistoricMVCCSnapshot snapshot;
 
 		/*
 		 * New command id for existing snapshot in a catalog changing tx. Set
@@ -359,7 +359,7 @@ typedef struct ReorderBufferTXN
 	 * transaction modifies the catalog, or another catalog-modifying
 	 * transaction commits.
 	 */
-	Snapshot	base_snapshot;
+	HistoricMVCCSnapshot base_snapshot;
 	XLogRecPtr	base_snapshot_lsn;
 	dlist_node	base_snapshot_node; /* link in txns_by_base_snapshot_lsn */
 
@@ -367,7 +367,7 @@ typedef struct ReorderBufferTXN
 	 * Snapshot/CID from the previous streaming run. Only valid for already
 	 * streamed transactions (NULL/InvalidCommandId otherwise).
 	 */
-	Snapshot	snapshot_now;
+	HistoricMVCCSnapshot snapshot_now;
 	CommandId	command_id;
 
 	/*
@@ -703,7 +703,7 @@ extern void ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid,
 									 XLogRecPtr lsn, ReorderBufferChange *change,
 									 bool toast_insert);
 extern void ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
-									  Snapshot snap, XLogRecPtr lsn,
+									  HistoricMVCCSnapshot snap, XLogRecPtr lsn,
 									  bool transactional, const char *prefix,
 									  Size message_size, const char *message);
 extern void ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
@@ -727,9 +727,9 @@ extern void ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr
 extern void ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn);
 
 extern void ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
-										 XLogRecPtr lsn, Snapshot snap);
+										 XLogRecPtr lsn, HistoricMVCCSnapshot snap);
 extern void ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
-									 XLogRecPtr lsn, Snapshot snap);
+									 XLogRecPtr lsn, HistoricMVCCSnapshot snap);
 extern void ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 										 XLogRecPtr lsn, CommandId cid);
 extern void ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..5930ffb55a8 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -70,15 +70,15 @@ extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
 										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *builder);
 
-extern void SnapBuildSnapDecRefcount(Snapshot snap);
+extern void SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap);
 
-extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern MVCCSnapshot SnapBuildInitialSnapshot(SnapBuild *builder);
 extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
 extern void SnapBuildClearExportedSnapshot(void);
 extern void SnapBuildResetExportedSnapshotState(void);
 
 extern SnapBuildState SnapBuildCurrentState(SnapBuild *builder);
-extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
+extern HistoricMVCCSnapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr);
 extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..9bed20efa31 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -74,7 +74,7 @@ struct SnapBuild
 	/*
 	 * Snapshot that's valid to see the catalog state seen at this moment.
 	 */
-	Snapshot	snapshot;
+	HistoricMVCCSnapshot snapshot;
 
 	/*
 	 * LSN of the last location we are sure a snapshot has been serialized to.
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 267d5d90e94..6a78dfeac96 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -47,8 +47,8 @@ extern void CheckPointPredicate(void);
 extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
 
 /* predicate lock maintenance */
-extern Snapshot GetSerializableTransactionSnapshot(Snapshot snapshot);
-extern void SetSerializableTransactionSnapshot(Snapshot snapshot,
+extern MVCCSnapshot GetSerializableTransactionSnapshot(MVCCSnapshot snapshot);
+extern void SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
 											   VirtualTransactionId *sourcevxid,
 											   int sourcepid);
 extern void RegisterPredicateLockingXid(TransactionId xid);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index ef0b733ebe8..7f5727c2586 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -44,7 +44,7 @@ extern void KnownAssignedTransactionIdsIdleMaintenance(void);
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
 
-extern Snapshot GetSnapshotData(Snapshot snapshot);
+extern MVCCSnapshot GetSnapshotData(MVCCSnapshot snapshot);
 
 extern bool ProcArrayInstallImportedXmin(TransactionId xmin,
 										 VirtualTransactionId *sourcevxid);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..1f627ff966d 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -49,7 +49,7 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
  */
 #define InitNonVacuumableSnapshot(snapshotdata, vistestp)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).vistest = (vistestp))
+	 (snapshotdata).nonvacuumable.vistest = (vistestp))
 
 /* This macro encodes the knowledge of which snapshots are MVCC-safe */
 #define IsMVCCSnapshot(snapshot)  \
@@ -89,7 +89,7 @@ extern void WaitForOlderSnapshots(TransactionId limitXmin, bool progress);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
 extern bool HaveRegisteredOrActiveSnapshot(void);
 
-extern char *ExportSnapshot(Snapshot snapshot);
+extern char *ExportSnapshot(MVCCSnapshot snapshot);
 
 /*
  * These live in procarray.c because they're intimately linked to the
@@ -105,18 +105,18 @@ extern bool GlobalVisCheckRemovableFullXid(Relation rel, FullTransactionId fxid)
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
-extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+extern bool XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot historic_snapshot, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(HistoricMVCCSnapshot historic_snapshot, struct HTAB *tuplecids);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-extern Size EstimateSnapshotSpace(Snapshot snapshot);
-extern void SerializeSnapshot(Snapshot snapshot, char *start_address);
-extern Snapshot RestoreSnapshot(char *start_address);
-extern void RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc);
+extern Size EstimateSnapshotSpace(MVCCSnapshot snapshot);
+extern void SerializeSnapshot(MVCCSnapshot snapshot, char *start_address);
+extern MVCCSnapshot RestoreSnapshot(char *start_address);
+extern void RestoreTransactionSnapshot(MVCCSnapshot snapshot, void *source_pgproc);
 
 #endif							/* SNAPMGR_H */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..93c1f51784f 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -17,7 +17,7 @@
 
 
 /*
- * The different snapshot types.  We use SnapshotData structures to represent
+ * The different snapshot types.  We use the SnapshotData union to represent
  * both "regular" (MVCC) snapshots and "special" snapshots that have non-MVCC
  * semantics.  The specific semantics of a snapshot are encoded by its type.
  *
@@ -27,6 +27,9 @@
  * The reason the snapshot type rather than a callback as it used to be is
  * that that allows to use the same snapshot for different table AMs without
  * having one callback per AM.
+ *
+ * The executor deals with MVCC snapshots, but the table AM and some other
+ * parts of the system also support the special snapshots.
  */
 typedef enum SnapshotType
 {
@@ -100,7 +103,9 @@ typedef enum SnapshotType
 	/*
 	 * A tuple is visible iff it follows the rules of SNAPSHOT_MVCC, but
 	 * supports being called in timetravel context (for decoding catalog
-	 * contents in the context of logical decoding).
+	 * contents in the context of logical decoding).  A historic MVCC snapshot
+	 * should only be used on catalog tables, as we only track XIDs that
+	 * modify catalogs during logical decoding.
 	 */
 	SNAPSHOT_HISTORIC_MVCC,
 
@@ -114,37 +119,18 @@ typedef enum SnapshotType
 	SNAPSHOT_NON_VACUUMABLE,
 } SnapshotType;
 
-typedef struct SnapshotData *Snapshot;
-
-#define InvalidSnapshot		((Snapshot) NULL)
-
 /*
- * Struct representing all kind of possible snapshots.
+ * Struct representing a normal MVCC snapshot.
  *
- * There are several different kinds of snapshots:
- * * Normal MVCC snapshots
- * * MVCC snapshots taken during recovery (in Hot-Standby mode)
- * * Historic MVCC snapshots used during logical decoding
- * * snapshots passed to HeapTupleSatisfiesDirty()
- * * snapshots passed to HeapTupleSatisfiesNonVacuumable()
- * * snapshots used for SatisfiesAny, Toast, Self where no members are
- *	 accessed.
- *
- * TODO: It's probably a good idea to split this struct using a NodeTag
- * similar to how parser and executor nodes are handled, with one type for
- * each different kind of snapshot to avoid overloading the meaning of
- * individual fields.
+ * MVCC snapshots come in two variants: those taken during recovery in hot
+ * standby mode, and "normal" MVCC snapshots.  They are distinguished by
+ * takenDuringRecovery.
  */
-typedef struct SnapshotData
+typedef struct MVCCSnapshotData
 {
-	SnapshotType snapshot_type; /* type of snapshot */
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
 
 	/*
-	 * The remaining fields are used only for MVCC snapshots, and are normally
-	 * just zeroes in special snapshots.  (But xmin and xmax are used
-	 * specially by HeapTupleSatisfiesDirty, and xmin is used specially by
-	 * HeapTupleSatisfiesNonVacuumable.)
-	 *
 	 * An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
 	 * the effects of all older XIDs except those listed in the snapshot. xmin
 	 * is stored as an optimization to avoid needing to search the XID arrays
@@ -154,10 +140,8 @@ typedef struct SnapshotData
 	TransactionId xmax;			/* all XID >= xmax are invisible to me */
 
 	/*
-	 * For normal MVCC snapshot this contains the all xact IDs that are in
-	 * progress, unless the snapshot was taken during recovery in which case
-	 * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
-	 * it contains *committed* transactions between xmin and xmax.
+	 * xip contains the all xact IDs that are in progress, unless the snapshot
+	 * was taken during recovery in which case it's empty.
 	 *
 	 * note: all ids in xip[] satisfy xmin <= xip[i] < xmax
 	 */
@@ -165,10 +149,8 @@ typedef struct SnapshotData
 	uint32		xcnt;			/* # of xact ids in xip[] */
 
 	/*
-	 * For non-historic MVCC snapshots, this contains subxact IDs that are in
-	 * progress (and other transactions that are in progress if taken during
-	 * recovery). For historic snapshot it contains *all* xids assigned to the
-	 * replayed transaction, including the toplevel xid.
+	 * subxip contains subxact IDs that are in progress (and other
+	 * transactions that are in progress if taken during recovery).
 	 *
 	 * note: all ids in subxip[] are >= xmin, but we don't bother filtering
 	 * out any that are >= xmax
@@ -182,18 +164,6 @@ typedef struct SnapshotData
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
-	/*
-	 * An extra return value for HeapTupleSatisfiesDirty, not used in MVCC
-	 * snapshots.
-	 */
-	uint32		speculativeToken;
-
-	/*
-	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
-	 * used to determine whether row could be vacuumed.
-	 */
-	struct GlobalVisState *vistest;
-
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
@@ -207,6 +177,97 @@ typedef struct SnapshotData
 	 * transactions completed since the last GetSnapshotData().
 	 */
 	uint64		snapXactCompletionCount;
+} MVCCSnapshotData;
+
+typedef struct MVCCSnapshotData *MVCCSnapshot;
+
+#define InvalidMVCCSnapshot ((MVCCSnapshot) NULL)
+
+/*
+ * Struct representing a "historic" MVCC snapshot during logical decoding.
+ * These are constructed by src/replication/logical/snapbuild.c.
+ */
+typedef struct HistoricMVCCSnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
+
+	/*
+	 * xmin and xmax like in a normal MVCC snapshot.
+	 */
+	TransactionId xmin;			/* all XID < xmin are visible to me */
+	TransactionId xmax;			/* all XID >= xmax are invisible to me */
+
+	/*
+	 * committed_xids contains *committed* transactions between xmin and xmax.
+	 * (This is the inverse of 'xip' in normal MVCC snapshots, which contains
+	 * all non-committed transactions.)  The array is sorted by XID to allow
+	 * binary search.
+	 *
+	 * note: all ids in committed_xids[] satisfy xmin <= committed_xids[i] <
+	 * xmax
+	 */
+	TransactionId *committed_xids;
+	uint32		xcnt;			/* # of xact ids in committed_xids[] */
+
+	/*
+	 * curxip contains *all* xids assigned to the replayed transaction,
+	 * including the toplevel xid.
+	 */
+	TransactionId *curxip;
+	int32		curxcnt;		/* # of xact ids in curxip[] */
+
+	CommandId	curcid;			/* in my xact, CID < curcid are visible */
+
+	bool		copied;			/* false if it's a "base" snapshot */
+
+	uint32		refcount;		/* refcount managed by snapbuild.c  */
+	uint32		regd_count;		/* refcount registered with resource owners */
+
+} HistoricMVCCSnapshotData;
+
+typedef struct HistoricMVCCSnapshotData *HistoricMVCCSnapshot;
+
+/*
+ * Struct representing a special "snapshot" which sees all tuples as visible
+ * if they are visible to anyone, i.e. if they are not vacuumable.
+ * i.e. SNAPSHOT_NON_VACUUMABLE.
+ */
+typedef struct NonVacuumableSnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
+
+	/* This is used to determine whether row could be vacuumed. */
+	struct GlobalVisState *vistest;
+} NonVacuumableSnapshotData;
+
+/*
+ * Return values to the caller of HeapTupleSatisfyDirty.
+ */
+typedef struct DirtySnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot, must be first */
+
+	TransactionId xmin;
+	TransactionId xmax;
+	uint32		speculativeToken;
+} DirtySnapshotData;
+
+/*
+ * Generic union representing all kind of possible snapshots.  Some have
+ * type-specific structs.
+ */
+typedef union SnapshotData
+{
+	SnapshotType snapshot_type; /* type of snapshot */
+
+	MVCCSnapshotData mvcc;
+	DirtySnapshotData dirty;
+	HistoricMVCCSnapshotData historic_mvcc;
+	NonVacuumableSnapshotData nonvacuumable;
 } SnapshotData;
 
+typedef union SnapshotData *Snapshot;
+
+#define InvalidSnapshot		((Snapshot) NULL)
+
 #endif							/* SNAPSHOT_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..c8ed18cf580 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -633,6 +633,7 @@ DictThesaurus
 DimensionInfo
 DirectoryMethodData
 DirectoryMethodFile
+DirtySnapshotData
 DisableTimeoutParams
 DiscardMode
 DiscardStmt
@@ -1183,6 +1184,7 @@ HeapTupleFreeze
 HeapTupleHeader
 HeapTupleHeaderData
 HeapTupleTableSlot
+HistoricMVCCSnapshotData
 HistControl
 HotStandbyState
 I32
@@ -1633,6 +1635,7 @@ MINIDUMPWRITEDUMP
 MINIDUMP_TYPE
 MJEvalResult
 MTTargetRelLookup
+MVCCSnapshotData
 MVDependencies
 MVDependency
 MVNDistinct
@@ -1732,6 +1735,7 @@ NextValueExpr
 Node
 NodeTag
 NonEmptyRange
+NonVacuumableSnapshotData
 Notification
 NotificationList
 NotifyStmt
-- 
2.39.5

v7-0002-Simplify-historic-snapshot-refcounting.patchapplication/octet-stream; name=v7-0002-Simplify-historic-snapshot-refcounting.patch; x-unix-mode=0644Download

From 3228848876610c7b13216ffca6b42a9f5465e300 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 13 Mar 2025 16:45:12 +0200
Subject: [PATCH v6 02/12] Simplify historic snapshot refcounting

ReorderBufferProcessTXN() handled "copied" snapshots created with
ReorderBufferCopySnap() differently from "base" historic snapshots
created by snapbuild.c. The base snapshots used a reference count,
while copied snapshots did not. Simplify by using the reference count
for both.
---
 .../replication/logical/reorderbuffer.c       | 97 ++++++++-----------
 src/backend/replication/logical/snapbuild.c   | 48 +--------
 src/include/replication/snapbuild.h           |  1 +
 src/include/utils/snapshot.h                  |  2 -
 4 files changed, 46 insertions(+), 102 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e8196a8d5d5..e47970f1c82 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -103,7 +103,7 @@
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
-#include "replication/snapbuild.h"	/* just for SnapBuildSnapDecRefcount */
+#include "replication/snapbuild.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
@@ -268,7 +268,6 @@ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
 
-static void ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap);
 static HistoricMVCCSnapshot ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 												  ReorderBufferTXN *txn, CommandId cid);
 
@@ -543,7 +542,7 @@ ReorderBufferFreeChange(ReorderBuffer *rb, ReorderBufferChange *change,
 		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 			if (change->data.snapshot)
 			{
-				ReorderBufferFreeSnap(rb, change->data.snapshot);
+				SnapBuildSnapDecRefcount(change->data.snapshot);
 				change->data.snapshot = NULL;
 			}
 			break;
@@ -1593,7 +1592,8 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (txn->snapshot_now != NULL)
 	{
 		Assert(rbtxn_is_streamed(txn));
-		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		SnapBuildSnapDecRefcount(txn->snapshot_now);
+		txn->snapshot_now = NULL;
 	}
 
 	/*
@@ -1902,7 +1902,6 @@ ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 	snap = MemoryContextAllocZero(rb->context, size);
 	memcpy(snap, orig_snap, sizeof(HistoricMVCCSnapshotData));
 
-	snap->copied = true;
 	snap->refcount = 1;			/* mark as active so nobody frees it */
 	snap->regd_count = 0;
 	snap->committed_xids = (TransactionId *) (snap + 1);
@@ -1942,18 +1941,6 @@ ReorderBufferCopySnap(ReorderBuffer *rb, HistoricMVCCSnapshot orig_snap,
 	return snap;
 }
 
-/*
- * Free a previously ReorderBufferCopySnap'ed snapshot
- */
-static void
-ReorderBufferFreeSnap(ReorderBuffer *rb, HistoricMVCCSnapshot snap)
-{
-	if (snap->copied)
-		pfree(snap);
-	else
-		SnapBuildSnapDecRefcount(snap);
-}
-
 /*
  * If the transaction was (partially) streamed, we need to prepare or commit
  * it in a 'streamed' way.  That is, we first stream the remaining part of the
@@ -2104,11 +2091,8 @@ ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	txn->command_id = command_id;
 
 	/* Avoid copying if it's already copied. */
-	if (snapshot_now->copied)
-		txn->snapshot_now = snapshot_now;
-	else
-		txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
-												  txn, command_id);
+	txn->snapshot_now = snapshot_now;
+	SnapBuildSnapIncRefcount(txn->snapshot_now);
 }
 
 /*
@@ -2208,6 +2192,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	/* setup the initial snapshot */
 	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	/* increase refcount for the installed historic snapshot */
+	SnapBuildSnapIncRefcount(snapshot_now);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -2511,33 +2497,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
 					/* get rid of the old */
 					TeardownHistoricSnapshot(false);
-
-					if (snapshot_now->copied)
-					{
-						ReorderBufferFreeSnap(rb, snapshot_now);
-						snapshot_now =
-							ReorderBufferCopySnap(rb, change->data.snapshot,
-												  txn, command_id);
-					}
-
-					/*
-					 * Restored from disk, need to be careful not to double
-					 * free. We could introduce refcounting for that, but for
-					 * now this seems infrequent enough not to care.
-					 */
-					else if (change->data.snapshot->copied)
-					{
-						snapshot_now =
-							ReorderBufferCopySnap(rb, change->data.snapshot,
-												  txn, command_id);
-					}
-					else
-					{
-						snapshot_now = change->data.snapshot;
-					}
+					SnapBuildSnapDecRefcount(snapshot_now);
 
 					/* and continue with the new one */
+					snapshot_now = change->data.snapshot;
 					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SnapBuildSnapIncRefcount(snapshot_now);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -2547,16 +2512,26 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					{
 						command_id = change->data.command_id;
 
-						if (!snapshot_now->copied)
+						TeardownHistoricSnapshot(false);
+
+						/*
+						 * Construct a new snapshot with the new command ID.
+						 *
+						 * If this is the only reference to the snapshot, and
+						 * it's a "copied" snapshot that already contains all
+						 * the replayed transaction's XIDs (curxnct > 0), we
+						 * can take a shortcut and update the snapshot's
+						 * command ID in place.
+						 */
+						if (snapshot_now->refcount == 1 && snapshot_now->curxcnt > 0)
+							snapshot_now->curcid = command_id;
+						else
 						{
-							/* we don't use the global one anymore */
+							SnapBuildSnapDecRefcount(snapshot_now);
 							snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
 																 txn, command_id);
 						}
 
-						snapshot_now->curcid = command_id;
-
-						TeardownHistoricSnapshot(false);
 						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
 					}
 
@@ -2646,11 +2621,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 			ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
-		else if (snapshot_now->copied)
-			ReorderBufferFreeSnap(rb, snapshot_now);
 
 		/* cleanup */
 		TeardownHistoricSnapshot(false);
+		SnapBuildSnapDecRefcount(snapshot_now);
+		snapshot_now = NULL;
 
 		/*
 		 * Aborting the current (sub-)transaction as a whole has the right
@@ -2703,6 +2678,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		TeardownHistoricSnapshot(true);
 
+		/*
+		 * don't decrement the refcount on snapshot_now yet, we still use it
+		 * in the ReorderBufferResetTXN() call below.
+		 */
+
 		/*
 		 * Force cache invalidation to happen outside of a valid transaction
 		 * to prevent catalog access as we just caught an error.
@@ -2751,9 +2731,15 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
+
+			SnapBuildSnapDecRefcount(snapshot_now);
+			snapshot_now = NULL;
 		}
 		else
 		{
+			SnapBuildSnapDecRefcount(snapshot_now);
+			snapshot_now = NULL;
+
 			ReorderBufferCleanupTXN(rb, txn);
 			MemoryContextSwitchTo(ecxt);
 			PG_RE_THROW();
@@ -4256,8 +4242,7 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 											 txn, command_id);
 
 		/* Free the previously copied snapshot. */
-		Assert(txn->snapshot_now->copied);
-		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+		SnapBuildSnapDecRefcount(txn->snapshot_now);
 		txn->snapshot_now = NULL;
 	}
 
@@ -4647,7 +4632,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				newsnap->committed_xids = (TransactionId *)
 					(((char *) newsnap) + sizeof(HistoricMVCCSnapshotData));
 				newsnap->curxip = newsnap->committed_xids + newsnap->xcnt;
-				newsnap->copied = true;
+				newsnap->refcount = 1;
 				break;
 			}
 			/* the base struct contains all the data, easy peasy */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7a341418a74..50dca7cb758 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -157,10 +157,6 @@ static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 /* snapshot building/manipulation/distribution functions */
 static HistoricMVCCSnapshot SnapBuildBuildSnapshot(SnapBuild *builder);
 
-static void SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap);
-
-static void SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap);
-
 static void SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
 static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, TransactionId xid,
@@ -245,29 +241,6 @@ FreeSnapshotBuilder(SnapBuild *builder)
 	MemoryContextDelete(context);
 }
 
-/*
- * Free an unreferenced snapshot that has previously been built by us.
- */
-static void
-SnapBuildFreeSnapshot(HistoricMVCCSnapshot snap)
-{
-	/* make sure we don't get passed an external snapshot */
-	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
-
-	/* make sure nobody modified our snapshot */
-	Assert(snap->curcid == FirstCommandId);
-	Assert(snap->regd_count == 0);
-
-	/* slightly more likely, so it's checked even without c-asserts */
-	if (snap->copied)
-		elog(ERROR, "cannot free a copied snapshot");
-
-	if (snap->refcount)
-		elog(ERROR, "cannot free a snapshot that's in use");
-
-	pfree(snap);
-}
-
 /*
  * In which state of snapshot building are we?
  */
@@ -310,7 +283,7 @@ SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
  * This is used when handing out a snapshot to some external resource or when
  * adding a Snapshot as builder->snapshot.
  */
-static void
+void
 SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap)
 {
 	snap->refcount++;
@@ -318,9 +291,6 @@ SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap)
 
 /*
  * Decrease refcount of a snapshot and free if the refcount reaches zero.
- *
- * Externally visible, so that external resources that have been handed an
- * IncRef'ed Snapshot can adjust its refcount easily.
  */
 void
 SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap)
@@ -328,19 +298,12 @@ SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap)
 	/* make sure we don't get passed an external snapshot */
 	Assert(snap->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
 
-	/* make sure nobody modified our snapshot */
-	Assert(snap->curcid == FirstCommandId);
-
 	Assert(snap->refcount > 0);
 	Assert(snap->regd_count == 0);
 
-	/* slightly more likely, so it's checked even without casserts */
-	if (snap->copied)
-		elog(ERROR, "cannot free a copied snapshot");
-
 	snap->refcount--;
 	if (snap->refcount == 0)
-		SnapBuildFreeSnapshot(snap);
+		pfree(snap);
 }
 
 /*
@@ -413,7 +376,6 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curxcnt = 0;
 	snapshot->curxip = NULL;
 
-	snapshot->copied = false;
 	snapshot->curcid = FirstCommandId;
 	snapshot->refcount = 0;
 	snapshot->regd_count = 0;
@@ -1037,18 +999,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
 		builder->snapshot = SnapBuildBuildSnapshot(builder);
+		SnapBuildSnapIncRefcount(builder->snapshot);
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
 		{
-			SnapBuildSnapIncRefcount(builder->snapshot);
 			ReorderBufferSetBaseSnapshot(builder->reorder, xid, lsn,
 										 builder->snapshot);
+			SnapBuildSnapIncRefcount(builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new catalog snapshot to all currently running transactions */
 		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 5930ffb55a8..6095013a299 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -70,6 +70,7 @@ extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *reorder,
 										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *builder);
 
+extern void SnapBuildSnapIncRefcount(HistoricMVCCSnapshot snap);
 extern void SnapBuildSnapDecRefcount(HistoricMVCCSnapshot snap);
 
 extern MVCCSnapshot SnapBuildInitialSnapshot(SnapBuild *builder);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 93c1f51784f..bca0ad16e68 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -218,8 +218,6 @@ typedef struct HistoricMVCCSnapshotData
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
-	bool		copied;			/* false if it's a "base" snapshot */
-
 	uint32		refcount;		/* refcount managed by snapbuild.c  */
 	uint32		regd_count;		/* refcount registered with resource owners */
 
-- 
2.39.5

v7-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patchapplication/octet-stream; name=v7-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patch; x-unix-mode=0644Download

From 1705639a73555d9b3f5884c7fd90540c268d3db5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 23:47:48 +0300
Subject: [PATCH v6 03/12] Add an explicit 'valid' flag to MVCCSnapshotData

The lifetime of the "static" snapshots returned by
GetTransactionSnapshot(), GetLatestSnapshot() and GetCatalogSnapshot()
is a bit vague. By adding an explicit 'valid' flag, we can make it
more clear when a function call updates a static snapshot, making it
valid, and when another function makes it invalid again. It's
currently only used in assertions, and can also be handy when
debugging.
---
 src/backend/storage/ipc/procarray.c |  2 ++
 src/backend/utils/time/snapmgr.c    | 15 +++++++++++++++
 src/include/utils/snapshot.h        |  1 +
 3 files changed, 18 insertions(+)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 535755614a9..ba5ed8960dd 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2135,6 +2135,7 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
+	snapshot->valid = true;
 
 	return true;
 }
@@ -2514,6 +2515,7 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
+	snapshot->valid = true;
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 78adb6d575a..69ed86b2101 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -447,6 +447,7 @@ InvalidateCatalogSnapshot(void)
 	{
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
+		CatalogSnapshotData.valid = false;
 		SnapshotResetXmin();
 	}
 }
@@ -611,6 +612,7 @@ CopyMVCCSnapshot(MVCCSnapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->valid = true;
 	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
@@ -652,6 +654,7 @@ FreeMVCCSnapshot(MVCCSnapshot snapshot)
 	Assert(snapshot->regd_count == 0);
 	Assert(snapshot->active_count == 0);
 	Assert(snapshot->copied);
+	Assert(snapshot->valid);
 
 	pfree(snapshot);
 }
@@ -688,6 +691,7 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 	origsnap = &snapshot->mvcc;
+	Assert(origsnap->valid);
 
 	Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
 
@@ -847,6 +851,7 @@ RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 
 	Assert(orig_snapshot->snapshot_type == SNAPSHOT_MVCC);
 	snapshot = &orig_snapshot->mvcc;
+	Assert(snapshot->valid);
 
 	/* Static snapshot?  Create a persistent copy */
 	snapshot = snapshot->copied ? snapshot : CopyMVCCSnapshot(snapshot);
@@ -968,6 +973,15 @@ SnapshotResetXmin(void)
 {
 	MVCCSnapshot minSnapshot;
 
+	/*
+	 * These static snapshots are not in the RegisteredSnapshots list, so we
+	 * might advance MyProc->xmin past their xmin. (Note that in case of
+	 * IsolationUsesXactSnapshot() == true, CurrentSnapshot points to the copy
+	 * in FirstSnapshot rather than CurrentSnapshotData.)
+	 */
+	CurrentSnapshotData.valid = false;
+	SecondarySnapshotData.valid = false;
+
 	if (ActiveSnapshot != NULL)
 		return;
 
@@ -1871,6 +1885,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->regd_count = 0;
 	snapshot->active_count = 0;
 	snapshot->copied = true;
+	snapshot->valid = true;
 
 	return snapshot;
 }
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index bca0ad16e68..1697c6df856 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -161,6 +161,7 @@ typedef struct MVCCSnapshotData
 
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
+	bool		valid;			/* is this snapshot valid? */
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
-- 
2.39.5

v7-0004-Replace-static-snapshot-pointers-with-the-valid-f.patchapplication/octet-stream; name=v7-0004-Replace-static-snapshot-pointers-with-the-valid-f.patch; x-unix-mode=0644Download

From 8cc814dc2e9fef8feda7cca9a0f2591c371b8ece Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 21:44:43 +0300
Subject: [PATCH v6 04/12] Replace static snapshot pointers with the 'valid'
 flags

Previously, we used the pointers like SecondarySnapshot and
CatalogSnapshot to indicate whether the corresponding static snapshot
is valid or not, but now that we have an explicit flag in
MVCCSnapshotData for that, we replace checks like "SecondarySnapshot
!= NULL" with "SecondarySnapshotData.valid", and get rid of the
separate pointer variables.

The situation with CurrentSnapshot was a bit more
complicated. Usually, it pointed to CurrentSnapshotData, but could
also point to the palloc'd FirstXactSnapshot. This gets rid of the
palloc'd FirstXactSnapshot, and instead we just refrain from modifying
CurrentSnapshotData when in a serializable transaction.
---
 src/backend/utils/time/snapmgr.c | 147 +++++++++++++++----------------
 1 file changed, 70 insertions(+), 77 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 69ed86b2101..ea1e7d17b04 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -67,8 +67,8 @@
  * In addition to snapshots pushed to the active snapshot stack, a snapshot
  * can be registered with a resource owner.
  *
- * The FirstXactSnapshot, if any, is treated a bit specially: we increment its
- * regd_count and list it in RegisteredSnapshots, but this reference is not
+ * If FirstXactSnapshotRegistered is set, we increment the static
+ * CurrentSnapshotData's regd_count and list it in RegisteredSnapshots, but this reference is not
  * tracked by a resource owner. We used to use the TopTransactionResourceOwner
  * to track this snapshot reference, but that introduces logical circularity
  * and thus makes it impossible to clean up in a sane fashion.  It's better to
@@ -145,9 +145,6 @@ SnapshotData SnapshotAnyData = {SNAPSHOT_ANY};
 SnapshotData SnapshotToastData = {SNAPSHOT_TOAST};
 
 /* Pointers to valid snapshots */
-static MVCCSnapshot CurrentSnapshot = NULL;
-static MVCCSnapshot SecondarySnapshot = NULL;
-static MVCCSnapshot CatalogSnapshot = NULL;
 static HistoricMVCCSnapshot HistoricSnapshot = NULL;
 
 /*
@@ -196,7 +193,7 @@ bool		FirstSnapshotSet = false;
  * FirstSnapshotSet in combination with IsolationUsesXactSnapshot(), because
  * GUC may be reset before us, changing the value of IsolationUsesXactSnapshot.
  */
-static MVCCSnapshot FirstXactSnapshot = NULL;
+static bool FirstXactSnapshotRegistered = false;
 
 /* Define pathname of exported-snapshot files */
 #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
@@ -288,7 +285,7 @@ GetTransactionSnapshot(void)
 		InvalidateCatalogSnapshot();
 
 		Assert(pairingheap_is_empty(&RegisteredSnapshots));
-		Assert(FirstXactSnapshot == NULL);
+		Assert(!FirstXactSnapshotRegistered);
 
 		if (IsInParallelMode())
 			elog(ERROR,
@@ -296,42 +293,44 @@ GetTransactionSnapshot(void)
 
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
-		 * end of xact regardless of what the caller does with it, so we must
-		 * make a copy of it rather than returning CurrentSnapshotData
-		 * directly.  Furthermore, if we're running in serializable mode,
-		 * predicate.c needs to wrap the snapshot fetch in its own processing.
+		 * end of xact regardless of what the caller does with it, so we keep
+		 * it in RegisteredSnapshots even though it's not tracked by any
+		 * resource owner.  Furthermore, if we're running in serializable
+		 * mode, predicate.c needs to wrap the snapshot fetch in its own
+		 * processing.
 		 */
 		if (IsolationUsesXactSnapshot())
 		{
 			/* First, create the snapshot in CurrentSnapshotData */
 			if (IsolationIsSerializable())
-				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
+				GetSerializableTransactionSnapshot(&CurrentSnapshotData);
 			else
-				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
-
-			/* Make a saved copy */
-			CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
-			FirstXactSnapshot = CurrentSnapshot;
-			/* Mark it as "registered" in FirstXactSnapshot */
-			FirstXactSnapshot->regd_count++;
-			pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+				GetSnapshotData(&CurrentSnapshotData);
+
+			/* Mark it as "registered" */
+			CurrentSnapshotData.regd_count++;
+			FirstXactSnapshotRegistered = true;
+			pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 		}
 		else
-			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+			GetSnapshotData(&CurrentSnapshotData);
 
 		FirstSnapshotSet = true;
-		return (Snapshot) CurrentSnapshot;
+		return (Snapshot) &CurrentSnapshotData;
 	}
 
 	if (IsolationUsesXactSnapshot())
-		return (Snapshot) CurrentSnapshot;
+	{
+		Assert(CurrentSnapshotData.valid);
+		return (Snapshot) &CurrentSnapshotData;
+	}
 
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
-	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+	GetSnapshotData(&CurrentSnapshotData);
 
-	return (Snapshot) CurrentSnapshot;
+	return (Snapshot) &CurrentSnapshotData;
 }
 
 /*
@@ -360,9 +359,9 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
-	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
+	GetSnapshotData(&SecondarySnapshotData);
 
-	return (Snapshot) SecondarySnapshot;
+	return (Snapshot) &SecondarySnapshotData;
 }
 
 /*
@@ -402,15 +401,15 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 	 * scan a relation for which neither catcache nor snapshot invalidations
 	 * are sent, we must refresh the snapshot every time.
 	 */
-	if (CatalogSnapshot &&
+	if (CatalogSnapshotData.valid &&
 		!RelationInvalidatesSnapshotsOnly(relid) &&
 		!RelationHasSysCache(relid))
 		InvalidateCatalogSnapshot();
 
-	if (CatalogSnapshot == NULL)
+	if (!CatalogSnapshotData.valid)
 	{
 		/* Get new snapshot. */
-		CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);
+		GetSnapshotData(&CatalogSnapshotData);
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
@@ -424,10 +423,10 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
+		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
 	}
 
-	return (Snapshot) CatalogSnapshot;
+	return (Snapshot) &CatalogSnapshotData;
 }
 
 /*
@@ -443,10 +442,9 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 void
 InvalidateCatalogSnapshot(void)
 {
-	if (CatalogSnapshot)
+	if (CatalogSnapshotData.valid)
 	{
-		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
-		CatalogSnapshot = NULL;
+		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
 		CatalogSnapshotData.valid = false;
 		SnapshotResetXmin();
 	}
@@ -465,7 +463,7 @@ InvalidateCatalogSnapshot(void)
 void
 InvalidateCatalogSnapshotConditionally(void)
 {
-	if (CatalogSnapshot &&
+	if (CatalogSnapshotData.valid &&
 		ActiveSnapshot == NULL &&
 		pairingheap_is_singular(&RegisteredSnapshots))
 		InvalidateCatalogSnapshot();
@@ -481,10 +479,10 @@ SnapshotSetCommandId(CommandId curcid)
 	if (!FirstSnapshotSet)
 		return;
 
-	if (CurrentSnapshot)
-		CurrentSnapshot->curcid = curcid;
-	if (SecondarySnapshot)
-		SecondarySnapshot->curcid = curcid;
+	if (CurrentSnapshotData.valid)
+		CurrentSnapshotData.curcid = curcid;
+	if (SecondarySnapshotData.valid)
+		SecondarySnapshotData.curcid = curcid;
 	/* Should we do the same with CatalogSnapshot? */
 }
 
@@ -507,7 +505,7 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	InvalidateCatalogSnapshot();
 
 	Assert(pairingheap_is_empty(&RegisteredSnapshots));
-	Assert(FirstXactSnapshot == NULL);
+	Assert(!FirstXactSnapshotRegistered);
 	Assert(!HistoricSnapshotActive());
 
 	/*
@@ -516,28 +514,28 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
 	 * the state for GlobalVis*.
 	 */
-	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+	GetSnapshotData(&CurrentSnapshotData);
 
 	/*
 	 * Now copy appropriate fields from the source snapshot.
 	 */
-	CurrentSnapshot->xmin = sourcesnap->xmin;
-	CurrentSnapshot->xmax = sourcesnap->xmax;
-	CurrentSnapshot->xcnt = sourcesnap->xcnt;
+	CurrentSnapshotData.xmin = sourcesnap->xmin;
+	CurrentSnapshotData.xmax = sourcesnap->xmax;
+	CurrentSnapshotData.xcnt = sourcesnap->xcnt;
 	Assert(sourcesnap->xcnt <= GetMaxSnapshotXidCount());
 	if (sourcesnap->xcnt > 0)
-		memcpy(CurrentSnapshot->xip, sourcesnap->xip,
+		memcpy(CurrentSnapshotData.xip, sourcesnap->xip,
 			   sourcesnap->xcnt * sizeof(TransactionId));
-	CurrentSnapshot->subxcnt = sourcesnap->subxcnt;
+	CurrentSnapshotData.subxcnt = sourcesnap->subxcnt;
 	Assert(sourcesnap->subxcnt <= GetMaxSnapshotSubxidCount());
 	if (sourcesnap->subxcnt > 0)
-		memcpy(CurrentSnapshot->subxip, sourcesnap->subxip,
+		memcpy(CurrentSnapshotData.subxip, sourcesnap->subxip,
 			   sourcesnap->subxcnt * sizeof(TransactionId));
-	CurrentSnapshot->suboverflowed = sourcesnap->suboverflowed;
-	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
+	CurrentSnapshotData.suboverflowed = sourcesnap->suboverflowed;
+	CurrentSnapshotData.takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
-	CurrentSnapshot->snapXactCompletionCount = 0;
+	CurrentSnapshotData.snapXactCompletionCount = 0;
 
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
@@ -552,13 +550,13 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	 */
 	if (sourceproc != NULL)
 	{
-		if (!ProcArrayInstallRestoredXmin(CurrentSnapshot->xmin, sourceproc))
+		if (!ProcArrayInstallRestoredXmin(CurrentSnapshotData.xmin, sourceproc))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("could not import the requested snapshot"),
 					 errdetail("The source transaction is not running anymore.")));
 	}
-	else if (!ProcArrayInstallImportedXmin(CurrentSnapshot->xmin, sourcevxid))
+	else if (!ProcArrayInstallImportedXmin(CurrentSnapshotData.xmin, sourcevxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("could not import the requested snapshot"),
@@ -567,20 +565,19 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 
 	/*
 	 * In transaction-snapshot mode, the first snapshot must live until end of
-	 * xact, so we must make a copy of it.  Furthermore, if we're running in
-	 * serializable mode, predicate.c needs to do its own processing.
+	 * xact, so we include it in RegisteredSnapshots.  Furthermore, if we're
+	 * running in serializable mode, predicate.c needs to do its own
+	 * processing.
 	 */
 	if (IsolationUsesXactSnapshot())
 	{
 		if (IsolationIsSerializable())
-			SetSerializableTransactionSnapshot(CurrentSnapshot, sourcevxid,
+			SetSerializableTransactionSnapshot(&CurrentSnapshotData, sourcevxid,
 											   sourcepid);
-		/* Make a saved copy */
-		CurrentSnapshot = CopyMVCCSnapshot(CurrentSnapshot);
-		FirstXactSnapshot = CurrentSnapshot;
-		/* Mark it as "registered" in FirstXactSnapshot */
-		FirstXactSnapshot->regd_count++;
-		pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+		/* Mark it as "registered" */
+		FirstXactSnapshotRegistered = true;
+		CurrentSnapshotData.regd_count++;
+		pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 	}
 
 	FirstSnapshotSet = true;
@@ -701,8 +698,7 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	 * Checking SecondarySnapshot is probably useless here, but it seems
 	 * better to be sure.
 	 */
-	if (origsnap == CurrentSnapshot || origsnap == SecondarySnapshot ||
-		!origsnap->copied)
+	if (!origsnap->copied)
 		newactive->as_snap = CopyMVCCSnapshot(origsnap);
 	else
 		newactive->as_snap = origsnap;
@@ -974,12 +970,10 @@ SnapshotResetXmin(void)
 	MVCCSnapshot minSnapshot;
 
 	/*
-	 * These static snapshots are not in the RegisteredSnapshots list, so we
-	 * might advance MyProc->xmin past their xmin. (Note that in case of
-	 * IsolationUsesXactSnapshot() == true, CurrentSnapshot points to the copy
-	 * in FirstSnapshot rather than CurrentSnapshotData.)
+	 * Invalidate these static snapshots so that we can advance xmin.
 	 */
-	CurrentSnapshotData.valid = false;
+	if (!FirstXactSnapshotRegistered)
+		CurrentSnapshotData.valid = false;
 	SecondarySnapshotData.valid = false;
 
 	if (ActiveSnapshot != NULL)
@@ -1068,13 +1062,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * stacked as active, we don't want the code below to be chasing through a
 	 * dangling pointer.
 	 */
-	if (FirstXactSnapshot != NULL)
+	if (FirstXactSnapshotRegistered)
 	{
-		Assert(FirstXactSnapshot->regd_count > 0);
+		Assert(CurrentSnapshotData.regd_count > 0);
 		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
-		pairingheap_remove(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+		pairingheap_remove(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
+		FirstXactSnapshotRegistered = false;
 	}
-	FirstXactSnapshot = NULL;
 
 	/*
 	 * If we exported any snapshots, clean them up.
@@ -1132,9 +1126,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	ActiveSnapshot = NULL;
 	pairingheap_reset(&RegisteredSnapshots);
 
-	CurrentSnapshot = NULL;
-	SecondarySnapshot = NULL;
-
+	CurrentSnapshotData.valid = false;
+	SecondarySnapshotData.valid = false;
 	FirstSnapshotSet = false;
 
 	/*
@@ -1695,7 +1688,7 @@ HaveRegisteredOrActiveSnapshot(void)
 	 * removed at any time due to invalidation processing. If explicitly
 	 * registered more than one snapshot has to be in RegisteredSnapshots.
 	 */
-	if (CatalogSnapshot != NULL &&
+	if (CatalogSnapshotData.valid &&
 		pairingheap_is_singular(&RegisteredSnapshots))
 		return false;
 
-- 
2.39.5

v7-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patchapplication/octet-stream; name=v7-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patch; x-unix-mode=0644Download

From 34b92db816f87fb06d8eff3c07e60c81b322e44d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 19:54:39 +0300
Subject: [PATCH v6 05/12] Make RestoreSnapshot register the snapshot with
 current resowner

This simplifies the next commit
---
 src/backend/access/index/indexam.c    | 1 -
 src/backend/access/table/tableam.c    | 1 -
 src/backend/access/transam/parallel.c | 4 ++++
 src/backend/utils/time/snapmgr.c      | 8 +++++++-
 4 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 769170a37d5..8f0ae02221c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -592,7 +592,6 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
 
 	snapshot = (Snapshot) RestoreSnapshot(pscan->ps_snapshot_data);
-	snapshot = RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
 									pscan, true);
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4eb81e40d99..fc823cf84e5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -175,7 +175,6 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = (Snapshot) RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
-		snapshot = RegisterSnapshot(snapshot);
 		flags |= SO_TEMP_SNAPSHOT;
 	}
 	else
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8046e14abf7..e13ea57efff 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1499,6 +1499,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	UnregisterSnapshot(asnapshot);
+	if (tsnapshot != asnapshot)
+		UnregisterSnapshot(tsnapshot);
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea1e7d17b04..ef579128d3f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1823,7 +1823,7 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
  *		Restore a serialized snapshot from the specified address.
  *
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
- * to 0.  The returned snapshot has the copied flag set.
+ * to 0.  The returned snapshot is registered with the current resource owner.
  */
 MVCCSnapshot
 RestoreSnapshot(char *start_address)
@@ -1880,6 +1880,12 @@ RestoreSnapshot(char *start_address)
 	snapshot->copied = true;
 	snapshot->valid = true;
 
+	/* and tell resowner.c about it, just like RegisterSnapshot() */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	snapshot->regd_count++;
+	ResourceOwnerRememberSnapshot(CurrentResourceOwner, (Snapshot) snapshot);
+	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+
 	return snapshot;
 }
 
-- 
2.39.5

v7-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patchapplication/octet-stream; name=v7-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patch; x-unix-mode=0644Download

From db70117e68b6f745c5ab9289e263aede7a068ac7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 23:50:55 +0300
Subject: [PATCH v6 06/12] Replace the RegisteredSnapshot pairing heap with a
 linked list

Previously, we kept all the snapshots in a pairing heap, so that we
could cheaply find the snapshot with the smallest xmin. However, we
can easily use a doubly-linked list instead, which is a little
simpler. A newly acquired snapshot's xmin is always greater than or
equal to that of any previous snapshot's, so we can simply push new
snapshots to the tail of the list, and the oldest xmin is always at
the head.

Previously, we would only push a snapshot to the heap when it's
registered or pushed to the active stack, not immediately when the
GetSnapshotData() was called. Because of that, snapshots were
sometimes added to the heap out of order. But if we update the list
earlier, after each GetSnapshotData() call, it stays in order. That
means that the list now contains *all* valid snapshots, including the
snapshots that are in the active stack, and the static CurrentSnapshot
and SecondarySnapshot, whenever they are valid. (CatalogSnapshot was
already tracked by the heap)
---
 src/backend/utils/time/snapmgr.c    | 279 +++++++++++++++++-----------
 src/include/access/spgist_private.h |   1 +
 src/include/utils/snapshot.h        |   6 +-
 3 files changed, 175 insertions(+), 111 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ef579128d3f..1c39cc11609 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -67,32 +67,22 @@
  * In addition to snapshots pushed to the active snapshot stack, a snapshot
  * can be registered with a resource owner.
  *
- * If FirstXactSnapshotRegistered is set, we increment the static
- * CurrentSnapshotData's regd_count and list it in RegisteredSnapshots, but this reference is not
- * tracked by a resource owner. We used to use the TopTransactionResourceOwner
- * to track this snapshot reference, but that introduces logical circularity
- * and thus makes it impossible to clean up in a sane fashion.  It's better to
- * handle this reference as an internally-tracked registration, so that this
- * module is entirely lower-level than ResourceOwners.
+ * Xmin tracking
+ * -------------
  *
- * Likewise, any snapshots that have been exported by pg_export_snapshot
- * have regd_count = 1 and are listed in RegisteredSnapshots, but are not
- * tracked by any resource owner.
+ * All valid snapshots, whether they are "static", included the active stack,
+ * or registered with a resource owner, are tracked in a doubly-linked list,
+ * ValidSnapshots.  Any snapshots that have been exported by
+ * pg_export_snapshot() are also listed there.  (They have regd_count = 1,
+ * even though they are not tracked by any resource owner).
  *
- * Likewise, the CatalogSnapshot is listed in RegisteredSnapshots when it
- * is valid, but is not tracked by any resource owner.
- *
- * The same is true for historic snapshots used during logical decoding,
- * their lifetime is managed separately (as they live longer than one xact.c
- * transaction).
- *
- * These arrangements let us reset MyProc->xmin when there are no snapshots
+ * The list is in xmin order, so that the tail always contains the oldest
+ * snapshot.  That let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
- * Xmin is no longer referenced.  For simplicity however, only registered
- * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyProc->xmin except when the active-snapshot
- * stack is empty.
+ * Xmin is no longer referenced.
  *
+ * The lifetime of historic snapshots used during logical decoding is managed
+ * separately (as they live longer than one xact.c transaction).
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -111,7 +101,6 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "datatype/timestamp.h"
-#include "lib/pairingheap.h"
 #include "miscadmin.h"
 #include "port/pg_lfind.h"
 #include "storage/fd.h"
@@ -177,13 +166,10 @@ typedef struct ActiveSnapshotElt
 static ActiveSnapshotElt *ActiveSnapshot = NULL;
 
 /*
- * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
+ * Currently valid Snapshots.  Ordered in a heap by xmin, so that we can
  * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
-static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
-					 void *arg);
-
-static pairingheap RegisteredSnapshots = {&xmin_cmp, NULL, NULL};
+static dlist_head ValidSnapshots = DLIST_STATIC_INIT(ValidSnapshots);
 
 /* first GetTransactionSnapshot call in a transaction? */
 bool		FirstSnapshotSet = false;
@@ -213,6 +199,8 @@ static MVCCSnapshot CopyMVCCSnapshot(MVCCSnapshot snapshot);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
 static void FreeMVCCSnapshot(MVCCSnapshot snapshot);
 static void SnapshotResetXmin(void);
+static void valid_snapshots_push_tail(MVCCSnapshot snapshot);
+static void valid_snapshots_push_out_of_order(MVCCSnapshot snapshot);
 
 /* ResourceOwner callbacks to track snapshot references */
 static void ResOwnerReleaseSnapshot(Datum res);
@@ -284,7 +272,7 @@ GetTransactionSnapshot(void)
 		 */
 		InvalidateCatalogSnapshot();
 
-		Assert(pairingheap_is_empty(&RegisteredSnapshots));
+		Assert(dlist_is_empty(&ValidSnapshots));
 		Assert(!FirstXactSnapshotRegistered);
 
 		if (IsInParallelMode())
@@ -308,12 +296,13 @@ GetTransactionSnapshot(void)
 				GetSnapshotData(&CurrentSnapshotData);
 
 			/* Mark it as "registered" */
-			CurrentSnapshotData.regd_count++;
 			FirstXactSnapshotRegistered = true;
-			pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 		}
 		else
+		{
 			GetSnapshotData(&CurrentSnapshotData);
+		}
+		valid_snapshots_push_tail(&CurrentSnapshotData);
 
 		FirstSnapshotSet = true;
 		return (Snapshot) &CurrentSnapshotData;
@@ -321,6 +310,7 @@ GetTransactionSnapshot(void)
 
 	if (IsolationUsesXactSnapshot())
 	{
+		Assert(FirstXactSnapshotRegistered);
 		Assert(CurrentSnapshotData.valid);
 		return (Snapshot) &CurrentSnapshotData;
 	}
@@ -328,7 +318,10 @@ GetTransactionSnapshot(void)
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
+	if (CurrentSnapshotData.valid)
+		dlist_delete(&CurrentSnapshotData.node);
 	GetSnapshotData(&CurrentSnapshotData);
+	valid_snapshots_push_tail(&CurrentSnapshotData);
 
 	return (Snapshot) &CurrentSnapshotData;
 }
@@ -359,7 +352,10 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
+	if (SecondarySnapshotData.valid)
+		dlist_delete(&SecondarySnapshotData.node);
 	GetSnapshotData(&SecondarySnapshotData);
+	valid_snapshots_push_tail(&SecondarySnapshotData);
 
 	return (Snapshot) &SecondarySnapshotData;
 }
@@ -423,7 +419,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
+		valid_snapshots_push_tail(&CatalogSnapshotData);
 	}
 
 	return (Snapshot) &CatalogSnapshotData;
@@ -444,10 +440,21 @@ InvalidateCatalogSnapshot(void)
 {
 	if (CatalogSnapshotData.valid)
 	{
-		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshotData.ph_node);
+		dlist_delete(&CatalogSnapshotData.node);
 		CatalogSnapshotData.valid = false;
-		SnapshotResetXmin();
 	}
+	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
+	{
+		dlist_delete(&CurrentSnapshotData.node);
+		CurrentSnapshotData.valid = false;
+	}
+	if (SecondarySnapshotData.valid)
+	{
+		dlist_delete(&SecondarySnapshotData.node);
+		SecondarySnapshotData.valid = false;
+	}
+
+	SnapshotResetXmin();
 }
 
 /*
@@ -464,8 +471,7 @@ void
 InvalidateCatalogSnapshotConditionally(void)
 {
 	if (CatalogSnapshotData.valid &&
-		ActiveSnapshot == NULL &&
-		pairingheap_is_singular(&RegisteredSnapshots))
+		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node)
 		InvalidateCatalogSnapshot();
 }
 
@@ -504,7 +510,6 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	/* Better do this to ensure following Assert succeeds. */
 	InvalidateCatalogSnapshot();
 
-	Assert(pairingheap_is_empty(&RegisteredSnapshots));
 	Assert(!FirstXactSnapshotRegistered);
 	Assert(!HistoricSnapshotActive());
 
@@ -576,9 +581,8 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 											   sourcepid);
 		/* Mark it as "registered" */
 		FirstXactSnapshotRegistered = true;
-		CurrentSnapshotData.regd_count++;
-		pairingheap_add(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
 	}
+	valid_snapshots_push_tail(&CurrentSnapshotData);
 
 	FirstSnapshotSet = true;
 }
@@ -699,7 +703,10 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	 * better to be sure.
 	 */
 	if (!origsnap->copied)
+	{
 		newactive->as_snap = CopyMVCCSnapshot(origsnap);
+		dlist_insert_after(&origsnap->node, &newactive->as_snap->node);
+	}
 	else
 		newactive->as_snap = origsnap;
 
@@ -722,8 +729,13 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 void
 PushCopiedSnapshot(Snapshot snapshot)
 {
+	MVCCSnapshot copy;
+
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
-	PushActiveSnapshot((Snapshot) CopyMVCCSnapshot(&snapshot->mvcc));
+
+	copy = CopyMVCCSnapshot(&snapshot->mvcc);
+	dlist_insert_after(&snapshot->mvcc.node, &copy->node);
+	PushActiveSnapshot((Snapshot) copy);
 }
 
 /*
@@ -776,7 +788,10 @@ PopActiveSnapshot(void)
 
 	if (ActiveSnapshot->as_snap->active_count == 0 &&
 		ActiveSnapshot->as_snap->regd_count == 0)
+	{
+		dlist_delete(&ActiveSnapshot->as_snap->node);
 		FreeMVCCSnapshot(ActiveSnapshot->as_snap);
+	}
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
@@ -850,16 +865,17 @@ RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 	Assert(snapshot->valid);
 
 	/* Static snapshot?  Create a persistent copy */
-	snapshot = snapshot->copied ? snapshot : CopyMVCCSnapshot(snapshot);
+	if (!snapshot->copied)
+	{
+		snapshot = CopyMVCCSnapshot(snapshot);
+		dlist_insert_after(&orig_snapshot->mvcc.node, &snapshot->node);
+	}
 
 	/* and tell resowner.c about it */
 	ResourceOwnerEnlarge(owner);
 	snapshot->regd_count++;
 	ResourceOwnerRememberSnapshot(owner, (Snapshot) snapshot);
 
-	if (snapshot->regd_count == 1)
-		pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
-
 	return (Snapshot) snapshot;
 }
 
@@ -901,14 +917,12 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 		MVCCSnapshot mvccsnap = &snapshot->mvcc;
 
 		Assert(mvccsnap->regd_count > 0);
-		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
+		Assert(!dlist_is_empty(&ValidSnapshots));
 
 		mvccsnap->regd_count--;
-		if (mvccsnap->regd_count == 0)
-			pairingheap_remove(&RegisteredSnapshots, &mvccsnap->ph_node);
-
 		if (mvccsnap->regd_count == 0 && mvccsnap->active_count == 0)
 		{
+			dlist_delete(&mvccsnap->node);
 			FreeMVCCSnapshot(mvccsnap);
 			SnapshotResetXmin();
 		}
@@ -933,24 +947,6 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 		elog(ERROR, "registered snapshot has unexpected type");
 }
 
-/*
- * Comparison function for RegisteredSnapshots heap.  Snapshots are ordered
- * by xmin, so that the snapshot with smallest xmin is at the top.
- */
-static int
-xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
-{
-	const MVCCSnapshotData *asnap = pairingheap_const_container(MVCCSnapshotData, ph_node, a);
-	const MVCCSnapshotData *bsnap = pairingheap_const_container(MVCCSnapshotData, ph_node, b);
-
-	if (TransactionIdPrecedes(asnap->xmin, bsnap->xmin))
-		return 1;
-	else if (TransactionIdFollows(asnap->xmin, bsnap->xmin))
-		return -1;
-	else
-		return 0;
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -972,21 +968,27 @@ SnapshotResetXmin(void)
 	/*
 	 * Invalidate these static snapshots so that we can advance xmin.
 	 */
-	if (!FirstXactSnapshotRegistered)
+	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
+	{
+		dlist_delete(&CurrentSnapshotData.node);
 		CurrentSnapshotData.valid = false;
-	SecondarySnapshotData.valid = false;
+	}
+	if (SecondarySnapshotData.valid)
+	{
+		dlist_delete(&SecondarySnapshotData.node);
+		SecondarySnapshotData.valid = false;
+	}
 
 	if (ActiveSnapshot != NULL)
 		return;
 
-	if (pairingheap_is_empty(&RegisteredSnapshots))
+	if (dlist_is_empty(&ValidSnapshots))
 	{
 		MyProc->xmin = TransactionXmin = InvalidTransactionId;
 		return;
 	}
 
-	minSnapshot = pairingheap_container(MVCCSnapshotData, ph_node,
-										pairingheap_first(&RegisteredSnapshots));
+	minSnapshot = dlist_head_element(MVCCSnapshotData, node, &ValidSnapshots);
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
 		MyProc->xmin = TransactionXmin = minSnapshot->xmin;
@@ -1035,7 +1037,10 @@ AtSubAbort_Snapshot(int level)
 
 		if (ActiveSnapshot->as_snap->active_count == 0 &&
 			ActiveSnapshot->as_snap->regd_count == 0)
+		{
+			dlist_delete(&ActiveSnapshot->as_snap->node);
 			FreeMVCCSnapshot(ActiveSnapshot->as_snap);
+		}
 
 		/* and free the stack element */
 		pfree(ActiveSnapshot);
@@ -1053,23 +1058,6 @@ AtSubAbort_Snapshot(int level)
 void
 AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 {
-	/*
-	 * In transaction-snapshot mode we must release our privately-managed
-	 * reference to the transaction snapshot.  We must remove it from
-	 * RegisteredSnapshots to keep the check below happy.  But we don't bother
-	 * to do FreeMVCCSnapshot, for two reasons: the memory will go away with
-	 * TopTransactionContext anyway, and if someone has left the snapshot
-	 * stacked as active, we don't want the code below to be chasing through a
-	 * dangling pointer.
-	 */
-	if (FirstXactSnapshotRegistered)
-	{
-		Assert(CurrentSnapshotData.regd_count > 0);
-		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
-		pairingheap_remove(&RegisteredSnapshots, &CurrentSnapshotData.ph_node);
-		FirstXactSnapshotRegistered = false;
-	}
-
 	/*
 	 * If we exported any snapshots, clean them up.
 	 */
@@ -1082,8 +1070,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 		 * it's too late to abort the transaction, and (2) leaving a leaked
 		 * file around has little real consequence anyway.
 		 *
-		 * We also need to remove the snapshots from RegisteredSnapshots to
-		 * prevent a warning below.
+		 * We also need to remove the snapshots from ValidSnapshots to prevent
+		 * a warning below.
 		 *
 		 * As with the FirstXactSnapshot, we don't need to free resources of
 		 * the snapshot itself as it will go away with the memory context.
@@ -1096,22 +1084,35 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 				elog(WARNING, "could not unlink file \"%s\": %m",
 					 esnap->snapfile);
 
-			pairingheap_remove(&RegisteredSnapshots,
-							   &esnap->snapshot->ph_node);
+			dlist_delete(&esnap->snapshot->node);
 		}
 
 		exportedSnapshots = NIL;
 	}
 
-	/* Drop catalog snapshot if any */
-	InvalidateCatalogSnapshot();
+	/* Drop all static snapshot */
+	if (CatalogSnapshotData.valid)
+	{
+		dlist_delete(&CatalogSnapshotData.node);
+		CatalogSnapshotData.valid = false;
+	}
+	if (CurrentSnapshotData.valid)
+	{
+		dlist_delete(&CurrentSnapshotData.node);
+		CurrentSnapshotData.valid = false;
+	}
+	if (SecondarySnapshotData.valid)
+	{
+		dlist_delete(&SecondarySnapshotData.node);
+		SecondarySnapshotData.valid = false;
+	}
 
 	/* On commit, complain about leftover snapshots */
 	if (isCommit)
 	{
 		ActiveSnapshotElt *active;
 
-		if (!pairingheap_is_empty(&RegisteredSnapshots))
+		if (!dlist_is_empty(&ValidSnapshots))
 			elog(WARNING, "registered snapshots seem to remain after cleanup");
 
 		/* complain about unpopped active snapshots */
@@ -1124,11 +1125,12 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * it'll go away with TopTransactionContext.
 	 */
 	ActiveSnapshot = NULL;
-	pairingheap_reset(&RegisteredSnapshots);
+	dlist_init(&ValidSnapshots);
 
 	CurrentSnapshotData.valid = false;
 	SecondarySnapshotData.valid = false;
 	FirstSnapshotSet = false;
+	FirstXactSnapshotRegistered = false;
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
@@ -1151,6 +1153,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 char *
 ExportSnapshot(MVCCSnapshot snapshot)
 {
+	MVCCSnapshot orig_snapshot;
 	TransactionId topXid;
 	TransactionId *children;
 	ExportedSnapshot *esnap;
@@ -1213,7 +1216,8 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	 * ensure that the snapshot's xmin is honored for the rest of the
 	 * transaction.
 	 */
-	snapshot = CopyMVCCSnapshot(snapshot);
+	orig_snapshot = snapshot;
+	snapshot = CopyMVCCSnapshot(orig_snapshot);
 
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 	esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
@@ -1223,7 +1227,7 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	MemoryContextSwitchTo(oldcxt);
 
 	snapshot->regd_count++;
-	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+	dlist_insert_after(&orig_snapshot->node, &snapshot->node);
 
 	/*
 	 * Fill buf with a text serialization of the snapshot, plus identification
@@ -1653,7 +1657,7 @@ DeleteAllExportedSnapshotFiles(void)
 
 /*
  * ThereAreNoPriorRegisteredSnapshots
- *		Is the registered snapshot count less than or equal to one?
+ *		Are there any snapshots other than the current active snapshot?
  *
  * Don't use this to settle important decisions.  While zero registrations and
  * no ActiveSnapshot would confirm a certain idleness, the system makes no
@@ -1662,11 +1666,25 @@ DeleteAllExportedSnapshotFiles(void)
 bool
 ThereAreNoPriorRegisteredSnapshots(void)
 {
-	if (pairingheap_is_empty(&RegisteredSnapshots) ||
-		pairingheap_is_singular(&RegisteredSnapshots))
-		return true;
+	dlist_iter	iter;
 
-	return false;
+	dlist_foreach(iter, &ValidSnapshots)
+	{
+		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+
+		if (FirstXactSnapshotRegistered)
+		{
+			Assert(CurrentSnapshotData.valid);
+			if (cur != &CurrentSnapshotData)
+				continue;
+		}
+		if (ActiveSnapshot && cur == ActiveSnapshot->as_snap)
+			continue;
+
+		return false;
+	}
+
+	return true;
 }
 
 /*
@@ -1684,15 +1702,18 @@ HaveRegisteredOrActiveSnapshot(void)
 		return true;
 
 	/*
-	 * The catalog snapshot is in RegisteredSnapshots when valid, but can be
+	 * The catalog snapshot is in ValidSnapshots when valid, but can be
 	 * removed at any time due to invalidation processing. If explicitly
-	 * registered more than one snapshot has to be in RegisteredSnapshots.
+	 * registered more than one snapshot has to be in ValidSnapshots.
 	 */
 	if (CatalogSnapshotData.valid &&
-		pairingheap_is_singular(&RegisteredSnapshots))
+		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node &&
+		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.node)
+	{
 		return false;
+	}
 
-	return !pairingheap_is_empty(&RegisteredSnapshots);
+	return !dlist_is_empty(&ValidSnapshots);
 }
 
 
@@ -1884,7 +1905,7 @@ RestoreSnapshot(char *start_address)
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 	snapshot->regd_count++;
 	ResourceOwnerRememberSnapshot(CurrentResourceOwner, (Snapshot) snapshot);
-	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+	valid_snapshots_push_out_of_order(snapshot);
 
 	return snapshot;
 }
@@ -2015,3 +2036,45 @@ ResOwnerReleaseSnapshot(Datum res)
 {
 	UnregisterSnapshotNoOwner((Snapshot) DatumGetPointer(res));
 }
+
+
+/* Helper functions to manipulate the ValidSnapshots list */
+
+/* dlist_push_tail, with assertion that the list stays ordered by xmin */
+static void
+valid_snapshots_push_tail(MVCCSnapshot snapshot)
+{
+#ifdef USE_ASSERT_CHECKING
+	if (!dlist_is_empty(&ValidSnapshots))
+	{
+		MVCCSnapshot tail = dlist_tail_element(MVCCSnapshotData, node, &ValidSnapshots);
+
+		Assert(TransactionIdFollowsOrEquals(snapshot->xmin, tail->xmin));
+	}
+#endif
+	dlist_push_tail(&ValidSnapshots, &snapshot->node);
+}
+
+/*
+ * Add an entry to the right position in the list, keeping it ordered by xmin.
+ *
+ * This is O(n), but that's OK because it's only used in rare occasions, when
+ * the list is small.
+ */
+static void
+valid_snapshots_push_out_of_order(MVCCSnapshot snapshot)
+{
+	dlist_iter	iter;
+
+	dlist_foreach(iter, &ValidSnapshots)
+	{
+		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+
+		if (TransactionIdFollowsOrEquals(snapshot->xmin, cur->xmin))
+		{
+			dlist_insert_after(&cur->node, &snapshot->node);
+			return;
+		}
+	}
+	dlist_push_tail(&ValidSnapshots, &snapshot->node);
+}
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..27ed1d77c9b 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -17,6 +17,7 @@
 #include "access/itup.h"
 #include "access/spgist.h"
 #include "catalog/pg_am_d.h"
+#include "lib/pairingheap.h"
 #include "nodes/tidbitmap.h"
 #include "storage/buf.h"
 #include "utils/geo_decls.h"
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 1697c6df856..44b3b20f73c 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,7 +13,7 @@
 #ifndef SNAPSHOT_H
 #define SNAPSHOT_H
 
-#include "lib/pairingheap.h"
+#include "lib/ilist.h"
 
 
 /*
@@ -169,8 +169,8 @@ typedef struct MVCCSnapshotData
 	 * Book-keeping information, used by the snapshot manager
 	 */
 	uint32		active_count;	/* refcount on ActiveSnapshot stack */
-	uint32		regd_count;		/* refcount on RegisteredSnapshots */
-	pairingheap_node ph_node;	/* link in the RegisteredSnapshots heap */
+	uint32		regd_count;		/* refcount of registrations in resowners */
+	dlist_node	node;			/* link in ValidSnapshots */
 
 	/*
 	 * The transaction completion count at the time GetSnapshotData() built
-- 
2.39.5

v7-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patchapplication/octet-stream; name=v7-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patch; x-unix-mode=0644Download

From 05443030201d59216b3125d51c641b68decd4379 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 21:46:54 +0300
Subject: [PATCH v6 07/12] Split MVCCSnapshot into inner and outer parts

Split MVCCSnapshot into two parts: inner struct to hold the xmin, xmax
and XID arrays that determine which transactions are visible, and an
outer shell that includes the command ID and a pointer to the inner
struct. That way, the inner struct can be shared by snapshots derived
from the same original snapshot, just with different command IDs.

The inner struct, MVCCSnapshotShared, is reference counted separately
so that we can avoid copying it when pushing or registering a snapshot
for the first time. Also, GetMVCCSnapshotData() can reuse it more
aggressively: we always keep a pointer to the latest shared struct
(latestSnapshotShared), and GetMVCCSnapshotData() always tries to
reuse the same latest snapshot, regardless of whether it was called
from GetTransactionSnapshot(), GetLatestSnapshot(), or
GetCatalogSnapshot(). That avoids unnecessary copying. Snapshots are
usually small so that it doesn't matter, but it can help in extreme
cases where you have thousands of (sub-)XIDs in progress.

Now that the shared inner structs are reference counted, it seems
unnecessary to reference count the outer MVCCSnapshots
separately. That means that RegisterSnapshot() always makes a new
palloc'd copy of the outer struct, but that's pretty small. The
ActiveSnapshot stack entries now embed the outer struct directly, so
the 'active_count' is gone too.

The ValidSnapshots list now tracks the shared structs rather than the
outer snapshots. That's sufficient for finding the oldest xmin, but if
we ever wanted to also know the oldest command ID in use, we'd need to
track the outer structs instead.
---
 contrib/amcheck/verify_heapam.c             |   2 +-
 contrib/amcheck/verify_nbtree.c             |   2 +-
 src/backend/access/heap/heapam.c            |   2 +-
 src/backend/access/heap/heapam_handler.c    |   2 +-
 src/backend/access/heap/heapam_visibility.c |  18 +-
 src/backend/access/spgist/spgvacuum.c       |   2 +-
 src/backend/access/transam/README           |  26 +-
 src/backend/catalog/pg_inherits.c           |   6 +-
 src/backend/commands/async.c                |   2 +-
 src/backend/commands/indexcmds.c            |   4 +-
 src/backend/commands/tablecmds.c            |   2 +-
 src/backend/executor/execMain.c             |  12 +-
 src/backend/executor/execParallel.c         |   3 +-
 src/backend/partitioning/partdesc.c         |   2 +-
 src/backend/replication/logical/snapbuild.c |  40 +-
 src/backend/replication/walsender.c         |   2 +-
 src/backend/storage/ipc/procarray.c         | 138 +++--
 src/backend/storage/lmgr/predicate.c        | 109 ++--
 src/backend/utils/adt/xid8funcs.c           |   8 +-
 src/backend/utils/time/snapmgr.c            | 605 ++++++++++----------
 src/include/access/transam.h                |   4 +-
 src/include/storage/predicate.h             |   8 +-
 src/include/storage/proc.h                  |   2 +-
 src/include/storage/procarray.h             |   2 +-
 src/include/utils/snapmgr.h                 |  11 +-
 src/include/utils/snapshot.h                |  51 +-
 src/tools/pgindent/typedefs.list            |   2 +
 27 files changed, 536 insertions(+), 531 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6665cafc179..d7f0b772f94 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -310,7 +310,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
 	 * while we're running.
 	 */
-	ctx.safe_xmin = GetTransactionSnapshot()->mvcc.xmin;
+	ctx.safe_xmin = GetTransactionSnapshot()->mvcc.shared->xmin;
 
 	/*
 	 * If we report corruption when not examining some individual attribute,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e90b4a2ad5a..d77ded4cc40 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -458,7 +458,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 */
 			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
 				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->mvcc.xmin))
+									   snapshot->mvcc.shared->xmin))
 				ereport(ERROR,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0cfa100cbd1..0615ffa2bd1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -606,7 +606,7 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	 * tuple for visibility the hard way.
 	 */
 	all_visible = PageIsAllVisible(page) &&
-		(snapshot->snapshot_type != SNAPSHOT_MVCC || !snapshot->mvcc.takenDuringRecovery);
+		(snapshot->snapshot_type != SNAPSHOT_MVCC || !snapshot->mvcc.shared->takenDuringRecovery);
 	check_serializable =
 		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index fce657f00f6..b9a5b38dd08 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2308,7 +2308,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
 
 	page = (Page) BufferGetPage(hscan->rs_cbuf);
 	all_visible = PageIsAllVisible(page) &&
-		(scan->rs_snapshot->snapshot_type != SNAPSHOT_MVCC || !scan->rs_snapshot->mvcc.takenDuringRecovery);
+		(scan->rs_snapshot->snapshot_type != SNAPSHOT_MVCC || !scan->rs_snapshot->mvcc.shared->takenDuringRecovery);
 	maxoffset = PageGetMaxOffsetNumber(page);
 
 	for (;;)
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index f5d69b558f1..07f155498d4 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -19,7 +19,7 @@
  * That fixes that problem, but it also means there is a window where
  * TransactionIdIsInProgress and TransactionIdDidCommit will both return true.
  * If we check only TransactionIdDidCommit, we could consider a tuple
- * committed when a later GetSnapshotData call will still think the
+ * committed when a later GetMVCCSnapshotData call will still think the
  * originating transaction is in progress, which leads to application-level
  * inconsistency.  The upshot is that we gotta check TransactionIdIsInProgress
  * first in all code paths, except for a few cases where we are looking at
@@ -969,7 +969,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 	 * get invalidated while it's still in use, and this is a convenient place
 	 * to check for that.
 	 */
-	Assert(snapshot->regd_count > 0 || snapshot->active_count > 0);
+	Assert(snapshot->kind == SNAPSHOT_ACTIVE || snapshot->kind == SNAPSHOT_REGISTERED);
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
@@ -986,7 +986,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 
 			if (TransactionIdIsCurrentTransactionId(xvac))
 				return false;
-			if (!XidInMVCCSnapshot(xvac, snapshot))
+			if (!XidInMVCCSnapshot(xvac, snapshot->shared))
 			{
 				if (TransactionIdDidCommit(xvac))
 				{
@@ -1005,7 +1005,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 
 			if (!TransactionIdIsCurrentTransactionId(xvac))
 			{
-				if (XidInMVCCSnapshot(xvac, snapshot))
+				if (XidInMVCCSnapshot(xvac, snapshot->shared))
 					return false;
 				if (TransactionIdDidCommit(xvac))
 					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
@@ -1060,7 +1060,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 			else
 				return false;	/* deleted before scan started */
 		}
-		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
+		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot->shared))
 			return false;
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
 			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
@@ -1077,7 +1077,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 	{
 		/* xmin is committed, but maybe not according to our snapshot */
 		if (!HeapTupleHeaderXminFrozen(tuple) &&
-			XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
+			XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot->shared))
 			return false;		/* treat as still in progress */
 	}
 
@@ -1108,7 +1108,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 			else
 				return false;	/* deleted before scan started */
 		}
-		if (XidInMVCCSnapshot(xmax, snapshot))
+		if (XidInMVCCSnapshot(xmax, snapshot->shared))
 			return true;
 		if (TransactionIdDidCommit(xmax))
 			return false;		/* updating transaction committed */
@@ -1126,7 +1126,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 				return false;	/* deleted before scan started */
 		}
 
-		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
+		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot->shared))
 			return true;
 
 		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
@@ -1144,7 +1144,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, MVCCSnapshot snapshot,
 	else
 	{
 		/* xmax is committed, but maybe not according to our snapshot */
-		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
+		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot->shared))
 			return true;		/* treat as still in progress */
 	}
 
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 850ad36cd0a..0a8d7b0a0d6 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -808,7 +808,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->mvcc.xmin;
+	bds->myXmin = GetActiveSnapshot()->mvcc.shared->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 231106270fd..81792f0eab3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -231,7 +231,7 @@ we must ensure consistency about the commit order of transactions.
 For example, suppose an UPDATE in xact A is blocked by xact B's prior
 update of the same row, and xact B is doing commit while xact C gets a
 snapshot.  Xact A can complete and commit as soon as B releases its locks.
-If xact C's GetSnapshotData sees xact B as still running, then it had
+If xact C's GetMVCCSnapshotData sees xact B as still running, then it had
 better see xact A as still running as well, or it will be able to see two
 tuple versions - one deleted by xact B and one inserted by xact A.  Another
 reason why this would be bad is that C would see (in the row inserted by A)
@@ -248,8 +248,8 @@ with snapshot-taking: we do not allow any transaction to exit the set of
 running transactions while a snapshot is being taken.  (This rule is
 stronger than necessary for consistency, but is relatively simple to
 enforce, and it assists with some other issues as explained below.)  The
-implementation of this is that GetSnapshotData takes the ProcArrayLock in
-shared mode (so that multiple backends can take snapshots in parallel),
+implementation of this is that GetMVCCSnapshotData takes the ProcArrayLock
+in shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
 while clearing the ProcGlobal->xids[] entry at transaction end (either
 commit or abort). (To reduce context switching, when multiple transactions
@@ -257,7 +257,7 @@ commit nearly simultaneously, we have one backend take ProcArrayLock and
 clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
+latestCompletedXid variable.  This allows GetMVCCSnapshotData to use
 latestCompletedXid + 1 as xmax for its snapshot: there can be no
 transaction >= this xid value that the snapshot needs to consider as
 completed.
@@ -301,7 +301,7 @@ if it currently has no live snapshots (eg, if it's between transactions or
 hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
 the MIN() of the valid xmin fields.  It does this with only shared lock on
 ProcArrayLock, which means there is a potential race condition against other
-backends doing GetSnapshotData concurrently: we must be certain that a
+backends doing GetMVCCSnapshotData concurrently: we must be certain that a
 concurrent backend that is about to set its xmin does not compute an xmin
 less than what ComputeXidHorizons determines.  We ensure that by including
 all the active XIDs into the MIN() calculation, along with the valid xmins.
@@ -310,27 +310,27 @@ ensures that concurrent holders of shared ProcArrayLock will compute the
 same minimum of currently-active XIDs: no xact, in particular not the
 oldest, can exit while we hold shared ProcArrayLock.  So
 ComputeXidHorizons's view of the minimum active XID will be the same as that
-of any concurrent GetSnapshotData, and so it can't produce an overestimate.
+of any concurrent GetMVCCSnapshotData, and so it can't produce an overestimate.
 If there is no active transaction at all, ComputeXidHorizons uses
 latestCompletedXid + 1, which is a lower bound for the xmin that might
-be computed by concurrent or later GetSnapshotData calls.  (We know that no
+be computed by concurrent or later GetMVCCSnapshotData calls.  (We know that no
 XID less than this could be about to appear in the ProcArray, because of the
 XidGenLock interlock discussed above.)
 
-As GetSnapshotData is performance critical, it does not perform an accurate
+As GetMVCCSnapshotData is performance critical, it does not perform an accurate
 oldest-xmin calculation (it used to, until v14). The contents of a snapshot
 only depend on the xids of other backends, not their xmin. As backend's xmin
-changes much more often than its xid, having GetSnapshotData look at xmins
+changes much more often than its xid, having GetMVCCSnapshotData look at xmins
 can lead to a lot of unnecessary cacheline ping-pong.  Instead
-GetSnapshotData updates approximate thresholds (one that guarantees that all
-deleted rows older than it can be removed, another determining that deleted
+GetMVCCSnapshotData updates approximate thresholds (one that guarantees that
+all deleted rows older than it can be removed, another determining that deleted
 rows newer than it can not be removed). GlobalVisTest* uses those thresholds
 to make invisibility decision, falling back to ComputeXidHorizons if
 necessary.
 
 Note that while it is certain that two concurrent executions of
-GetSnapshotData will compute the same xmin for their own snapshots, there is
-no such guarantee for the horizons computed by ComputeXidHorizons.  This is
+GetMVCCSnapshotData will compute the same xmin for their own snapshots, there
+is no such guarantee for the horizons computed by ComputeXidHorizons.  This is
 because we allow XID-less transactions to clear their MyProc->xmin
 asynchronously (without taking ProcArrayLock), so one execution might see
 what had been the oldest xmin, and another not.  This is OK since the
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index b658601bf77..f1148dbe4a3 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -143,12 +143,12 @@ find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
 			if (omit_detached && ActiveSnapshotSet())
 			{
 				TransactionId xmin;
-				Snapshot	snap;
+				MVCCSnapshot snap;
 
 				xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
-				snap = GetActiveSnapshot();
+				snap = (MVCCSnapshot) GetActiveSnapshot();
 
-				if (!XidInMVCCSnapshot(xmin, (MVCCSnapshot) snap))
+				if (!XidInMVCCSnapshot(xmin, snap->shared))
 				{
 					if (detached_xmin)
 					{
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 1ffb6f5fa70..037ca6c5444 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2043,7 +2043,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 		/* Ignore messages destined for other databases */
 		if (qe->dboid == MyDatabaseId)
 		{
-			if (XidInMVCCSnapshot(qe->xid, (MVCCSnapshot) snapshot))
+			if (XidInMVCCSnapshot(qe->xid, ((MVCCSnapshot) snapshot)->shared))
 			{
 				/*
 				 * The source transaction is still in progress, so we can't
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index da3e02398bb..7fa044f6f1c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1761,7 +1761,7 @@ DefineIndex(Oid tableId,
 	 * they must wait for.  But first, save the snapshot's xmin to use as
 	 * limitXmin for GetCurrentVirtualXIDs().
 	 */
-	limitXmin = snapshot->mvcc.xmin;
+	limitXmin = snapshot->mvcc.shared->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
@@ -4156,7 +4156,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * We can now do away with our active snapshot, we still need to save
 		 * the xmin limit to wait for older snapshots.
 		 */
-		limitXmin = snapshot->mvcc.xmin;
+		limitXmin = snapshot->mvcc.shared->xmin;
 
 		PopActiveSnapshot();
 		UnregisterSnapshot(snapshot);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c55b5a7a014..9aca810f9d5 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -20797,7 +20797,7 @@ ATExecDetachPartitionFinalize(Relation rel, RangeVar *name)
 	 * all such queries are complete (otherwise we would present them with an
 	 * inconsistent view of catalogs).
 	 */
-	WaitForOlderSnapshots(snap->mvcc.xmin, false);
+	WaitForOlderSnapshots(snap->mvcc.shared->xmin, false);
 
 	DetachPartitionFinalize(rel, partRel, true, InvalidOid);
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2da848970be..9ee10050873 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -157,8 +157,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	Assert(queryDesc != NULL);
 	Assert(queryDesc->estate == NULL);
 
-	/* caller must ensure the query's snapshot is active */
-	Assert(GetActiveSnapshot() == queryDesc->snapshot);
+	/* ensure the query's snapshot is active */
+	PushActiveSnapshot(queryDesc->snapshot);
 
 	/*
 	 * If the transaction is read-only, we need to check if any writes are
@@ -272,6 +272,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 
 	MemoryContextSwitchTo(oldcontext);
 
+	PopActiveSnapshot();
+
 	return ExecPlanStillValid(queryDesc->estate);
 }
 
@@ -390,8 +392,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	Assert(!estate->es_aborted);
 	Assert(!(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY));
 
-	/* caller must ensure the query's snapshot is active */
-	Assert(GetActiveSnapshot() == estate->es_snapshot);
+	/* ensure the query's snapshot is active */
+	PushActiveSnapshot(estate->es_snapshot);
 
 	/*
 	 * Switch into per-query memory context
@@ -455,6 +457,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
 
 	MemoryContextSwitchTo(oldcontext);
+
+	PopActiveSnapshot();
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 39c990ae638..af3f8f28144 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -737,7 +737,8 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	 * worker, which uses it to set es_snapshot.  Make sure we don't set
 	 * es_snapshot differently in the child.
 	 */
-	Assert(GetActiveSnapshot() == estate->es_snapshot);
+	Assert(((MVCCSnapshot) GetActiveSnapshot())->shared == ((MVCCSnapshot) estate->es_snapshot)->shared);
+	Assert(((MVCCSnapshot) GetActiveSnapshot())->curcid == ((MVCCSnapshot) estate->es_snapshot)->curcid);
 
 	/* Everyone's had a chance to ask for space, so now create the DSM. */
 	InitializeParallelDSM(pcxt);
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 7c15c634181..c5000b37b87 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -102,7 +102,7 @@ RelationGetPartitionDesc(Relation rel, bool omit_detached)
 		Assert(TransactionIdIsValid(rel->rd_partdesc_nodetached_xmin));
 		activesnap = GetActiveSnapshot();
 
-		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, &activesnap->mvcc))
+		if (!XidInMVCCSnapshot(rel->rd_partdesc_nodetached_xmin, activesnap->mvcc.shared))
 			return rel->rd_partdesc_nodetached;
 	}
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 50dca7cb758..3c94a62cdf6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -389,6 +389,12 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
  *
  * The snapshot will be usable directly in current transaction or exported
  * for loading in different transaction.
+ *
+ * XXX: The snapshot manager doesn't know anything about the returned
+ * snapshot.  It does not hold back MyProc->xmin, nor is it registered with
+ * any resource owner.  There's also no good way to free it, but leaking it is
+ * acceptable for the current usage where only one snapshot is build for the
+ * whole session.
  */
 MVCCSnapshot
 SnapBuildInitialSnapshot(SnapBuild *builder)
@@ -440,11 +446,14 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	MyProc->xmin = historicsnap->xmin;
 
 	/* allocate in transaction context */
-	mvccsnap = palloc(sizeof(MVCCSnapshotData) + sizeof(TransactionId) * GetMaxSnapshotXidCount());
+	mvccsnap = palloc(sizeof(MVCCSnapshotData));
+	mvccsnap->kind = SNAPSHOT_STATIC;
+	mvccsnap->shared = AllocMVCCSnapshotShared();
+	mvccsnap->shared->refcount = 1;
 	mvccsnap->snapshot_type = SNAPSHOT_MVCC;
-	mvccsnap->xmin = historicsnap->xmin;
-	mvccsnap->xmax = historicsnap->xmax;
-	mvccsnap->xip = (TransactionId *) ((char *) mvccsnap + sizeof(MVCCSnapshotData));
+	mvccsnap->shared->xmin = historicsnap->xmin;
+	mvccsnap->shared->xmax = historicsnap->xmax;
+	mvccsnap->shared->xip = (TransactionId *) ((char *) mvccsnap->shared + sizeof(MVCCSnapshotData));
 
 	/*
 	 * snapbuild.c builds transactions in an "inverted" manner, which means it
@@ -470,23 +479,20 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("initial slot snapshot too large")));
 
-			mvccsnap->xip[newxcnt++] = xid;
+			mvccsnap->shared->xip[newxcnt++] = xid;
 		}
 
 		TransactionIdAdvance(xid);
 	}
-	mvccsnap->xcnt = newxcnt;
+	mvccsnap->shared->xcnt = newxcnt;
 
 	/* Initialize remaining MVCCSnapshot fields */
-	mvccsnap->subxip = NULL;
-	mvccsnap->subxcnt = 0;
-	mvccsnap->suboverflowed = false;
-	mvccsnap->takenDuringRecovery = false;
-	mvccsnap->copied = true;
+	mvccsnap->shared->subxip = NULL;
+	mvccsnap->shared->subxcnt = 0;
+	mvccsnap->shared->suboverflowed = false;
+	mvccsnap->shared->takenDuringRecovery = false;
+	mvccsnap->shared->snapXactCompletionCount = 0;
 	mvccsnap->curcid = FirstCommandId;
-	mvccsnap->active_count = 0;
-	mvccsnap->regd_count = 0;
-	mvccsnap->snapXactCompletionCount = 0;
 
 	pfree(historicsnap);
 
@@ -528,13 +534,13 @@ SnapBuildExportSnapshot(SnapBuild *builder)
 	 * now that we've built a plain snapshot, make it active and use the
 	 * normal mechanisms for exporting it
 	 */
-	snapname = ExportSnapshot(snap);
+	snapname = ExportSnapshot(snap->shared);
 
 	ereport(LOG,
 			(errmsg_plural("exported logical decoding snapshot: \"%s\" with %u transaction ID",
 						   "exported logical decoding snapshot: \"%s\" with %u transaction IDs",
-						   snap->xcnt,
-						   snapname, snap->xcnt)));
+						   snap->shared->xcnt,
+						   snapname, snap->shared->xcnt)));
 	return snapname;
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1a7a35e25eb..513449ea9de 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2620,7 +2620,7 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetSnapshotData() /
+	 * the xmin will be taken into account by GetMVCCSnapshotData() /
 	 * ComputeXidHorizons().  This will hold back the removal of dead rows and
 	 * thereby prevent the generation of cleanup conflicts on the standby
 	 * server.
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ba5ed8960dd..819649741f6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -62,6 +62,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
@@ -105,7 +106,7 @@ typedef struct ProcArrayStruct
  * MVCC semantics: If the deleted row's xmax is not considered to be running
  * by anyone, the row can be removed.
  *
- * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * To avoid slowing down GetMVCCSnapshotData(), we don't calculate a precise
  * cutoff XID while building a snapshot (looking at the frequently changing
  * xmins scales badly). Instead we compute two boundaries while building the
  * snapshot:
@@ -159,7 +160,7 @@ typedef struct ProcArrayStruct
  *
  * The boundaries are FullTransactionIds instead of TransactionIds to avoid
  * wraparound dangers. There e.g. would otherwise exist no procarray state to
- * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * prevent maybe_needed to become old enough after the GetMVCCSnapshotData()
  * call.
  *
  * The typedef is in the header.
@@ -386,7 +387,7 @@ ProcArrayShmemSize(void)
 	/*
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetSnapshotData(),
+	 * also created in various backends during GetMVCCSnapshotData(),
 	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
 	 * main structures created in those functions must be identically sized,
 	 * since we may at times copy the whole of the data structures around. We
@@ -938,7 +939,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 
 	/*
 	 * Need to increment completion count even though transaction hasn't
-	 * really committed yet. The reason for that is that GetSnapshotData()
+	 * really committed yet. The reason for that is that GetMVCCSnapshotData()
 	 * omits the xid of the current transaction, thus without the increment we
 	 * otherwise could end up reusing the snapshot later. Which would be bad,
 	 * because it might not count the prepared transaction as running.
@@ -2083,7 +2084,7 @@ GetMaxSnapshotSubxidCount(void)
 }
 
 /*
- * Helper function for GetSnapshotData() that checks if the bulk of the
+ * Helper function for GetMVCCSnapshotData() that checks if the bulk of the
  * visibility information in the snapshot is still valid. If so, it updates
  * the fields that need to change and returns true. Otherwise it returns
  * false.
@@ -2092,7 +2093,7 @@ GetMaxSnapshotSubxidCount(void)
  * least in the case we already hold a snapshot), but that's for another day.
  */
 static bool
-GetSnapshotDataReuse(MVCCSnapshot snapshot)
+GetMVCCSnapshotDataReuse(MVCCSnapshotShared snapshot)
 {
 	uint64		curXactCompletionCount;
 
@@ -2112,17 +2113,18 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
 	 * contents:
 	 *
 	 * As explained in transam/README, the set of xids considered running by
-	 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
-	 * contents only depend on transactions with xids and xactCompletionCount
-	 * is incremented whenever a transaction with an xid finishes (while
-	 * holding ProcArrayLock exclusively). Thus the xactCompletionCount check
-	 * ensures we would detect if the snapshot would have changed.
+	 * GetMVCCSnapshotData() cannot change while ProcArrayLock is held.
+	 * Snapshot contents only depend on transactions with xids and
+	 * xactCompletionCount is incremented whenever a transaction with an xid
+	 * finishes (while holding ProcArrayLock exclusively). Thus the
+	 * xactCompletionCount check ensures we would detect if the snapshot would
+	 * have changed.
 	 *
 	 * As the snapshot contents are the same as it was before, it is safe to
 	 * re-enter the snapshot's xmin into the PGPROC array. None of the rows
 	 * visible under the snapshot could already have been removed (that'd
 	 * require the set of running transactions to change) and it fulfills the
-	 * requirement that concurrent GetSnapshotData() calls yield the same
+	 * requirement that concurrent GetMVCCSnapshotData() calls yield the same
 	 * xmin.
 	 */
 	if (!TransactionIdIsValid(MyProc->xmin))
@@ -2131,17 +2133,11 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
 	RecentXmin = snapshot->xmin;
 	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
-	snapshot->curcid = GetCurrentCommandId(false);
-	snapshot->active_count = 0;
-	snapshot->regd_count = 0;
-	snapshot->copied = false;
-	snapshot->valid = true;
-
 	return true;
 }
 
 /*
- * GetSnapshotData -- returns information about running transactions.
+ * GetMVCCSnapshotData -- returns information about running transactions.
  *
  * The returned snapshot includes xmin (lowest still-running xact ID),
  * xmax (highest completed xact ID + 1), and a list of running xact IDs
@@ -2168,12 +2164,9 @@ GetSnapshotDataReuse(MVCCSnapshot snapshot)
  *
  * And try to advance the bounds of GlobalVis{Shared,Catalog,Data,Temp}Rels
  * for the benefit of the GlobalVisTest* family of functions.
- *
- * Note: this function should probably not be called with an argument that's
- * not statically allocated (see xip allocation below).
  */
-MVCCSnapshot
-GetSnapshotData(MVCCSnapshot snapshot)
+MVCCSnapshotShared
+GetMVCCSnapshotData(void)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2187,43 +2180,34 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	int			mypgxactoff;
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
+	MVCCSnapshotShared snapshot;
 
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
-	Assert(snapshot != NULL);
-
-	/*
-	 * Allocating space for maxProcs xids is usually overkill; numProcs would
-	 * be sufficient.  But it seems better to do the malloc while not holding
-	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
-	 * more subxip storage than is probably needed.
+	/*---
+	 * Allocate an MVCCSnapshotShared struct.  There are three cases:
+	 *
+	 * 1. No transactions have completed since the last call: we can reuse the
+	 *    latest snapshot information.  See GetMVCCSnapshotDataReuse().
+	 *
+	 * 2. Need to recalculate the snapshot, and 'latestSnapshotShared' is not
+	 *    currently in use by any snapshot.  We can overwrite its contents.
+	 *
+	 * 3. Need to recalculate the XID list and 'latestSnapshotShared' is still
+	 *    in use.  We need to allocate a new MVCCSnapshotShared struct.
 	 *
-	 * This does open a possibility for avoiding repeated malloc/free: since
-	 * maxProcs does not change at runtime, we can simply reuse the previous
-	 * xip arrays if any.  (This relies on the fact that all callers pass
-	 * static SnapshotData structs.)
+	 * We don't know if 'latestSnapshotShared' can be reused before we acquire
+	 * the lock, but if we do need to allocate, we want to do it before
+	 * acquiring the lock.  Therefore, we always make the allocation if we
+	 * might need it and if it turns out to have been unnecessary, we stash
+	 * away the allocated struct in 'spareSnapshotShared' to be reused on next
+	 * call.  This way, the unnecessary allocation is very cheap.
 	 */
-	if (snapshot->xip == NULL)
-	{
-		/*
-		 * First call for this snapshot. Snapshot is same size whether or not
-		 * we are in recovery, see later comments.
-		 */
-		snapshot->xip = (TransactionId *)
-			malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
-		if (snapshot->xip == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-		Assert(snapshot->subxip == NULL);
-		snapshot->subxip = (TransactionId *)
-			malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
-		if (snapshot->subxip == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-	}
+	if (latestSnapshotShared && latestSnapshotShared->refcount == 0)
+		snapshot = latestSnapshotShared;	/* case 1 or 2 */
+	else
+		snapshot = AllocMVCCSnapshotShared();	/* case 1 or 3 */
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
@@ -2231,10 +2215,14 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	if (GetSnapshotDataReuse(snapshot))
+	if (latestSnapshotShared && GetMVCCSnapshotDataReuse(latestSnapshotShared))
 	{
 		LWLockRelease(ProcArrayLock);
-		return snapshot;
+
+		/* if we made an allocation, stash it away for next call */
+		if (snapshot != latestSnapshotShared)
+			spareSnapshotShared = snapshot;
+		return latestSnapshotShared;
 	}
 
 	latest_completed = TransamVariables->latestCompletedXid;
@@ -2506,16 +2494,18 @@ GetSnapshotData(MVCCSnapshot snapshot)
 	snapshot->suboverflowed = suboverflowed;
 	snapshot->snapXactCompletionCount = curXactCompletionCount;
 
-	snapshot->curcid = GetCurrentCommandId(false);
-
 	/*
-	 * This is a new snapshot, so set both refcounts are zero, and mark it as
-	 * not copied in persistent memory.
+	 * If we allocated a new struct for this, remember that it is the latest
+	 * now and adjust the refcounts accordingly.
 	 */
-	snapshot->active_count = 0;
-	snapshot->regd_count = 0;
-	snapshot->copied = false;
-	snapshot->valid = true;
+	if (snapshot != latestSnapshotShared)
+	{
+		Assert(snapshot->refcount == 0);
+
+		if (latestSnapshotShared && latestSnapshotShared->refcount == 0)
+			FreeMVCCSnapshotShared(latestSnapshotShared);
+		latestSnapshotShared = snapshot;
+	}
 
 	return snapshot;
 }
@@ -2585,10 +2575,10 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 			continue;
 
 		/*
-		 * We're good.  Install the new xmin.  As in GetSnapshotData, set
+		 * We're good.  Install the new xmin.  As in GetMVCCSnapshotData, set
 		 * TransactionXmin too.  (Note that because snapmgr.c called
-		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
-		 * we don't check that.)
+		 * GetMVCCSnapshotData first, we'll be overwriting a valid xmin here,
+		 * so we don't check that.)
 		 */
 		MyProc->xmin = TransactionXmin = xmin;
 
@@ -2659,7 +2649,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 /*
  * GetRunningTransactionData -- returns information about running transactions.
  *
- * Similar to GetSnapshotData but returns more information. We include
+ * Similar to GetMVCCSnapshotData but returns more information. We include
  * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
@@ -2681,7 +2671,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * entries here to not hold on ProcArrayLock more than necessary.
  *
  * We don't worry about updating other counters, we want to keep this as
- * simple as possible and leave GetSnapshotData() as the primary code for
+ * simple as possible and leave GetMVCCSnapshotData() as the primary code for
  * that bookkeeping.
  *
  * Note that if any transaction has overflowed its cached subtransactions
@@ -2866,8 +2856,8 @@ GetRunningTransactionData(void)
 /*
  * GetOldestActiveTransactionId()
  *
- * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGPROCs with an assigned TransactionId, even VACUUM processes.
+ * Similar to GetMVCCSnapshotData but returns just oldestActiveXid. We
+ * include all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2875,7 +2865,7 @@ GetRunningTransactionData(void)
  * KnownAssignedXids.
  *
  * We don't worry about updating other counters, we want to keep this as
- * simple as possible and leave GetSnapshotData() as the primary code for
+ * simple as possible and leave GetMVCCSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
@@ -4356,7 +4346,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
  * During hot standby we do not fret too much about the distinction between
  * top-level XIDs and subtransaction XIDs. We store both together in the
  * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
+ * GetMVCCSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
  * doesn't care about the distinction either.  Subtransaction XIDs are
  * effectively treated as top-level XIDs and in the typical case pg_subtrans
  * links are *not* maintained (which does not affect visibility).
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index dd52782ff22..edc6b9de7ca 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -449,10 +449,10 @@ static void SerialSetActiveSerXmin(TransactionId xid);
 
 static uint32 predicatelock_hash(const void *key, Size keysize);
 static void SummarizeOldestCommittedSxact(void);
-static MVCCSnapshot GetSafeSnapshot(MVCCSnapshot origSnapshot);
-static MVCCSnapshot GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
-														  VirtualTransactionId *sourcevxid,
-														  int sourcepid);
+static MVCCSnapshotShared GetSafeSnapshot(void);
+static MVCCSnapshotShared GetSerializableTransactionSnapshotInt(VirtualTransactionId *sourcevxid,
+																TransactionId sourcexmin,
+																int sourcepid);
 static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
 static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
 									  PREDICATELOCKTARGETTAG *parent);
@@ -1542,25 +1542,20 @@ SummarizeOldestCommittedSxact(void)
  *
  *		As with GetSerializableTransactionSnapshot (which this is a subroutine
  *		for), the passed-in Snapshot pointer should reference a static data
- *		area that can safely be passed to GetSnapshotData.
+ *		area that can safely be passed to GetMVCCSnapshotData.
  */
-static MVCCSnapshot
-GetSafeSnapshot(MVCCSnapshot origSnapshot)
+static MVCCSnapshotShared
+GetSafeSnapshot(void)
 {
-	MVCCSnapshot snapshot;
+	MVCCSnapshotShared snapshot;
 
 	Assert(XactReadOnly && XactDeferrable);
 
 	while (true)
 	{
-		/*
-		 * GetSerializableTransactionSnapshotInt is going to call
-		 * GetSnapshotData, so we need to provide it the static snapshot area
-		 * our caller passed to us.  The pointer returned is actually the same
-		 * one passed to it, but we avoid assuming that here.
-		 */
-		snapshot = GetSerializableTransactionSnapshotInt(origSnapshot,
-														 NULL, InvalidPid);
+		snapshot = GetSerializableTransactionSnapshotInt(NULL,
+														 InvalidTransactionId,
+														 InvalidPid);
 
 		if (MySerializableXact == InvalidSerializableXact)
 			return snapshot;	/* no concurrent r/w xacts; it's safe */
@@ -1663,13 +1658,11 @@ GetSafeSnapshotBlockingPids(int blocked_pid, int *output, int output_size)
  * Make sure we have a SERIALIZABLEXACT reference in MySerializableXact.
  * It should be current for this process and be contained in PredXact.
  *
- * The passed-in Snapshot pointer should reference a static data area that
- * can safely be passed to GetSnapshotData.  The return value is actually
- * always this same pointer; no new snapshot data structure is allocated
- * within this function.
+ * This calls GetMVCCSnapshotData to do the heavy lifting, but also sets up
+ * shared memory data structures specific to serializable transactions.
  */
-MVCCSnapshot
-GetSerializableTransactionSnapshot(MVCCSnapshot snapshot)
+MVCCSnapshotShared
+GetSerializableTransactionSnapshotData(void)
 {
 	Assert(IsolationIsSerializable());
 
@@ -1692,26 +1685,25 @@ GetSerializableTransactionSnapshot(MVCCSnapshot snapshot)
 	 * thereby avoid all SSI overhead once it's running.
 	 */
 	if (XactReadOnly && XactDeferrable)
-		return GetSafeSnapshot(snapshot);
+		return GetSafeSnapshot();
 
-	return GetSerializableTransactionSnapshotInt(snapshot,
-												 NULL, InvalidPid);
+	return GetSerializableTransactionSnapshotInt(NULL, InvalidTransactionId, InvalidPid);
 }
 
 /*
  * Import a snapshot to be used for the current transaction.
  *
- * This is nearly the same as GetSerializableTransactionSnapshot, except that
- * we don't take a new snapshot, but rather use the data we're handed.
+ * This is nearly the same as GetSerializableTransactionSnapshotData, except
+ * that we don't take a new snapshot, but rather use the data we're handed.
  *
  * The caller must have verified that the snapshot came from a serializable
  * transaction; and if we're read-write, the source transaction must not be
  * read-only.
  */
 void
-SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
-								   VirtualTransactionId *sourcevxid,
-								   int sourcepid)
+SetSerializableTransactionSnapshotData(MVCCSnapshotShared snapshot,
+									   VirtualTransactionId *sourcevxid,
+									   int sourcepid)
 {
 	Assert(IsolationIsSerializable());
 
@@ -1737,28 +1729,29 @@ SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("a snapshot-importing transaction must not be READ ONLY DEFERRABLE")));
 
-	(void) GetSerializableTransactionSnapshotInt(snapshot, sourcevxid,
-												 sourcepid);
+	(void) GetSerializableTransactionSnapshotInt(sourcevxid, snapshot->xmin, sourcepid);
 }
 
 /*
  * Guts of GetSerializableTransactionSnapshot
  *
  * If sourcevxid is valid, this is actually an import operation and we should
- * skip calling GetSnapshotData, because the snapshot contents are already
+ * skip calling GetMVCCSnapshotData, because the snapshot contents are already
  * loaded up.  HOWEVER: to avoid race conditions, we must check that the
  * source xact is still running after we acquire SerializableXactHashLock.
  * We do that by calling ProcArrayInstallImportedXmin.
  */
-static MVCCSnapshot
-GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
-									  VirtualTransactionId *sourcevxid,
+static MVCCSnapshotShared
+GetSerializableTransactionSnapshotInt(VirtualTransactionId *sourcevxid,
+									  TransactionId sourcexmin,
 									  int sourcepid)
 {
 	PGPROC	   *proc;
 	VirtualTransactionId vxid;
 	SERIALIZABLEXACT *sxact,
 			   *othersxact;
+	MVCCSnapshotShared snapshot;
+	TransactionId xmin;
 
 	/* We only do this for serializable transactions.  Once. */
 	Assert(MySerializableXact == InvalidSerializableXact);
@@ -1783,7 +1776,7 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 	 *
 	 * We must hold SerializableXactHashLock when taking/checking the snapshot
 	 * to avoid race conditions, for much the same reasons that
-	 * GetSnapshotData takes the ProcArrayLock.  Since we might have to
+	 * GetMVCCSnapshotData takes the ProcArrayLock.  Since we might have to
 	 * release SerializableXactHashLock to call SummarizeOldestCommittedSxact,
 	 * this means we have to create the sxact first, which is a bit annoying
 	 * (in particular, an elog(ERROR) in procarray.c would cause us to leak
@@ -1807,16 +1800,24 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 
 	/* Get the snapshot, or check that it's safe to use */
 	if (!sourcevxid)
-		snapshot = GetSnapshotData(snapshot);
-	else if (!ProcArrayInstallImportedXmin(snapshot->xmin, sourcevxid))
 	{
-		ReleasePredXact(sxact);
-		LWLockRelease(SerializableXactHashLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("could not import the requested snapshot"),
-				 errdetail("The source process with PID %d is not running anymore.",
-						   sourcepid)));
+		snapshot = GetMVCCSnapshotData();
+		xmin = snapshot->xmin;
+	}
+	else
+	{
+		if (!ProcArrayInstallImportedXmin(sourcexmin, sourcevxid))
+		{
+			ReleasePredXact(sxact);
+			LWLockRelease(SerializableXactHashLock);
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("could not import the requested snapshot"),
+					 errdetail("The source process with PID %d is not running anymore.",
+							   sourcepid)));
+		}
+		snapshot = NULL;
+		xmin = sourcexmin;
 	}
 
 	/*
@@ -1848,7 +1849,7 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 	dlist_init(&(sxact->possibleUnsafeConflicts));
 	sxact->topXid = GetTopTransactionIdIfAny();
 	sxact->finishedBefore = InvalidTransactionId;
-	sxact->xmin = snapshot->xmin;
+	sxact->xmin = xmin;
 	sxact->pid = MyProcPid;
 	sxact->pgprocno = MyProcNumber;
 	dlist_init(&sxact->predicateLocks);
@@ -1902,18 +1903,18 @@ GetSerializableTransactionSnapshotInt(MVCCSnapshot snapshot,
 	if (!TransactionIdIsValid(PredXact->SxactGlobalXmin))
 	{
 		Assert(PredXact->SxactGlobalXminCount == 0);
-		PredXact->SxactGlobalXmin = snapshot->xmin;
+		PredXact->SxactGlobalXmin = xmin;
 		PredXact->SxactGlobalXminCount = 1;
-		SerialSetActiveSerXmin(snapshot->xmin);
+		SerialSetActiveSerXmin(xmin);
 	}
-	else if (TransactionIdEquals(snapshot->xmin, PredXact->SxactGlobalXmin))
+	else if (TransactionIdEquals(xmin, PredXact->SxactGlobalXmin))
 	{
 		Assert(PredXact->SxactGlobalXminCount > 0);
 		PredXact->SxactGlobalXminCount++;
 	}
 	else
 	{
-		Assert(TransactionIdFollows(snapshot->xmin, PredXact->SxactGlobalXmin));
+		Assert(TransactionIdFollows(xmin, PredXact->SxactGlobalXmin));
 	}
 
 	MySerializableXact = sxact;
@@ -3968,13 +3969,13 @@ XidIsConcurrent(TransactionId xid)
 
 	snap = (MVCCSnapshot) GetTransactionSnapshot();
 
-	if (TransactionIdPrecedes(xid, snap->xmin))
+	if (TransactionIdPrecedes(xid, snap->shared->xmin))
 		return false;
 
-	if (TransactionIdFollowsOrEquals(xid, snap->xmax))
+	if (TransactionIdFollowsOrEquals(xid, snap->shared->xmax))
 		return true;
 
-	return pg_lfind32(xid, snap->xip, snap->xcnt);
+	return pg_lfind32(xid, snap->shared->xip, snap->shared->xcnt);
 }
 
 bool
diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c
index d4aa8ef9e4e..eef632390cb 100644
--- a/src/backend/utils/adt/xid8funcs.c
+++ b/src/backend/utils/adt/xid8funcs.c
@@ -380,7 +380,7 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 		elog(ERROR, "no active snapshot set");
 
 	/* allocate */
-	nxip = cur->xcnt;
+	nxip = cur->shared->xcnt;
 	snap = palloc(PG_SNAPSHOT_SIZE(nxip));
 
 	/*
@@ -389,12 +389,12 @@ pg_current_snapshot(PG_FUNCTION_ARGS)
 	 * advance past any of these XIDs.  Hence, these XIDs remain allowable
 	 * relative to next_fxid.
 	 */
-	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->xmin);
-	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->xmax);
+	snap->xmin = FullTransactionIdFromAllowableAt(next_fxid, cur->shared->xmin);
+	snap->xmax = FullTransactionIdFromAllowableAt(next_fxid, cur->shared->xmax);
 	snap->nxip = nxip;
 	for (i = 0; i < nxip; i++)
 		snap->xip[i] =
-			FullTransactionIdFromAllowableAt(next_fxid, cur->xip[i]);
+			FullTransactionIdFromAllowableAt(next_fxid, cur->shared->xip[i]);
 
 	/*
 	 * We want them guaranteed to be in ascending order.  This also removes
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c39cc11609..5f9f2b9d8b2 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -122,9 +122,6 @@
  * special-purpose code (say, RI checking.)  CatalogSnapshot points to an
  * MVCC snapshot intended to be used for catalog scans; we must invalidate it
  * whenever a system catalog change occurs.
- *
- * These SnapshotData structs are static to simplify memory allocation
- * (see the hack in GetSnapshotData to avoid repeated malloc/free).
  */
 static MVCCSnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
 static MVCCSnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
@@ -137,7 +134,7 @@ SnapshotData SnapshotToastData = {SNAPSHOT_TOAST};
 static HistoricMVCCSnapshot HistoricSnapshot = NULL;
 
 /*
- * These are updated by GetSnapshotData.  We initialize them this way
+ * These are updated by GetMVCCSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
  */
@@ -150,14 +147,12 @@ static HTAB *tuplecid_data = NULL;
 /*
  * Elements of the active snapshot stack.
  *
- * Each element here accounts for exactly one active_count on SnapshotData.
- *
  * NB: the code assumes that elements in this list are in non-increasing
  * order of as_level; also, the list must be NULL-terminated.
  */
 typedef struct ActiveSnapshotElt
 {
-	MVCCSnapshot as_snap;
+	MVCCSnapshotData as_snap;
 	int			as_level;
 	struct ActiveSnapshotElt *as_next;
 } ActiveSnapshotElt;
@@ -188,19 +183,23 @@ static bool FirstXactSnapshotRegistered = false;
 typedef struct ExportedSnapshot
 {
 	char	   *snapfile;
-	MVCCSnapshot snapshot;
+	MVCCSnapshotShared snapshot;
 } ExportedSnapshot;
 
 /* Current xact's exported snapshots (a list of ExportedSnapshot structs) */
 static List *exportedSnapshots = NIL;
 
+MVCCSnapshotShared latestSnapshotShared = NULL;
+MVCCSnapshotShared spareSnapshotShared = NULL;
+
 /* Prototypes for local functions */
-static MVCCSnapshot CopyMVCCSnapshot(MVCCSnapshot snapshot);
+static void UpdateStaticMVCCSnapshot(MVCCSnapshot snapshot, MVCCSnapshotShared shared);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeMVCCSnapshot(MVCCSnapshot snapshot);
 static void SnapshotResetXmin(void);
-static void valid_snapshots_push_tail(MVCCSnapshot snapshot);
-static void valid_snapshots_push_out_of_order(MVCCSnapshot snapshot);
+static void ReleaseMVCCSnapshotShared(MVCCSnapshotShared shared);
+static void valid_snapshots_push_tail(MVCCSnapshotShared snapshot);
+static void valid_snapshots_push_out_of_order(MVCCSnapshotShared snapshot);
+
 
 /* ResourceOwner callbacks to track snapshot references */
 static void ResOwnerReleaseSnapshot(Datum res);
@@ -266,6 +265,8 @@ GetTransactionSnapshot(void)
 	/* First call in transaction? */
 	if (!FirstSnapshotSet)
 	{
+		MVCCSnapshotShared shared;
+
 		/*
 		 * Don't allow catalog snapshot to be older than xact snapshot.  Must
 		 * do this first to allow the empty-heap Assert to succeed.
@@ -287,23 +288,18 @@ GetTransactionSnapshot(void)
 		 * mode, predicate.c needs to wrap the snapshot fetch in its own
 		 * processing.
 		 */
+		if (IsolationIsSerializable())
+			shared = GetSerializableTransactionSnapshotData();
+		else
+			shared = GetMVCCSnapshotData();
+
+		UpdateStaticMVCCSnapshot(&CurrentSnapshotData, shared);
+
 		if (IsolationUsesXactSnapshot())
 		{
-			/* First, create the snapshot in CurrentSnapshotData */
-			if (IsolationIsSerializable())
-				GetSerializableTransactionSnapshot(&CurrentSnapshotData);
-			else
-				GetSnapshotData(&CurrentSnapshotData);
-
-			/* Mark it as "registered" */
+			/* keep it */
 			FirstXactSnapshotRegistered = true;
 		}
-		else
-		{
-			GetSnapshotData(&CurrentSnapshotData);
-		}
-		valid_snapshots_push_tail(&CurrentSnapshotData);
-
 		FirstSnapshotSet = true;
 		return (Snapshot) &CurrentSnapshotData;
 	}
@@ -318,14 +314,31 @@ GetTransactionSnapshot(void)
 	/* Don't allow catalog snapshot to be older than xact snapshot. */
 	InvalidateCatalogSnapshot();
 
-	if (CurrentSnapshotData.valid)
-		dlist_delete(&CurrentSnapshotData.node);
-	GetSnapshotData(&CurrentSnapshotData);
-	valid_snapshots_push_tail(&CurrentSnapshotData);
-
+	UpdateStaticMVCCSnapshot(&CurrentSnapshotData, GetMVCCSnapshotData());
 	return (Snapshot) &CurrentSnapshotData;
 }
 
+/*
+ * Update a static snapshot with the given shared struct.
+ *
+ * If the static snapshot is previously valid, release its old 'shared'
+ * struct first.
+ */
+static void
+UpdateStaticMVCCSnapshot(MVCCSnapshot snapshot, MVCCSnapshotShared shared)
+{
+	/* Replace the 'shared' struct */
+	if (snapshot->shared)
+		ReleaseMVCCSnapshotShared(snapshot->shared);
+	snapshot->shared = shared;
+	snapshot->shared->refcount++;
+	if (snapshot->shared->refcount == 1)
+		valid_snapshots_push_tail(shared);
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->valid = true;
+}
+
 /*
  * GetLatestSnapshot
  *		Get a snapshot that is up-to-date as of the current instant,
@@ -352,10 +365,7 @@ GetLatestSnapshot(void)
 	if (!FirstSnapshotSet)
 		return GetTransactionSnapshot();
 
-	if (SecondarySnapshotData.valid)
-		dlist_delete(&SecondarySnapshotData.node);
-	GetSnapshotData(&SecondarySnapshotData);
-	valid_snapshots_push_tail(&SecondarySnapshotData);
+	UpdateStaticMVCCSnapshot(&SecondarySnapshotData, GetMVCCSnapshotData());
 
 	return (Snapshot) &SecondarySnapshotData;
 }
@@ -405,7 +415,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 	if (!CatalogSnapshotData.valid)
 	{
 		/* Get new snapshot. */
-		GetSnapshotData(&CatalogSnapshotData);
+		UpdateStaticMVCCSnapshot(&CatalogSnapshotData, GetMVCCSnapshotData());
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
@@ -419,7 +429,6 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		valid_snapshots_push_tail(&CatalogSnapshotData);
 	}
 
 	return (Snapshot) &CatalogSnapshotData;
@@ -440,17 +449,20 @@ InvalidateCatalogSnapshot(void)
 {
 	if (CatalogSnapshotData.valid)
 	{
-		dlist_delete(&CatalogSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CatalogSnapshotData.shared);
+		CatalogSnapshotData.shared = NULL;
 		CatalogSnapshotData.valid = false;
 	}
 	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
 	{
-		dlist_delete(&CurrentSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+		CurrentSnapshotData.shared = NULL;
 		CurrentSnapshotData.valid = false;
 	}
 	if (SecondarySnapshotData.valid)
 	{
-		dlist_delete(&SecondarySnapshotData.node);
+		ReleaseMVCCSnapshotShared(SecondarySnapshotData.shared);
+		SecondarySnapshotData.shared = NULL;
 		SecondarySnapshotData.valid = false;
 	}
 
@@ -465,13 +477,14 @@ InvalidateCatalogSnapshot(void)
  * want to continue holding the catalog snapshot if it might mean that the
  * global xmin horizon can't advance.  However, if there are other snapshots
  * still active or registered, the catalog snapshot isn't likely to be the
- * oldest one, so we might as well keep it.
+ * oldest one, so we might as well keep it. XXX
  */
 void
 InvalidateCatalogSnapshotConditionally(void)
 {
 	if (CatalogSnapshotData.valid &&
-		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node)
+		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.shared->node &&
+		CatalogSnapshotData.shared->refcount == 1)
 		InvalidateCatalogSnapshot();
 }
 
@@ -501,7 +514,7 @@ SnapshotSetCommandId(CommandId curcid)
  * in GetTransactionSnapshot.
  */
 static void
-SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid,
+SetTransactionSnapshot(MVCCSnapshotShared sourcesnap, VirtualTransactionId *sourcevxid,
 					   int sourcepid, PGPROC *sourceproc)
 {
 	/* Caller should have checked this already */
@@ -512,38 +525,25 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 
 	Assert(!FirstXactSnapshotRegistered);
 	Assert(!HistoricSnapshotActive());
+	Assert(sourcesnap->refcount > 0);
 
 	/*
 	 * Even though we are not going to use the snapshot it computes, we must
-	 * call GetSnapshotData, for two reasons: (1) to be sure that
-	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * the state for GlobalVis*.
+	 * call GetMVCCSnapshotData to update the state for GlobalVis*.
 	 */
-	GetSnapshotData(&CurrentSnapshotData);
+	UpdateStaticMVCCSnapshot(&CurrentSnapshotData, GetMVCCSnapshotData());
 
 	/*
 	 * Now copy appropriate fields from the source snapshot.
 	 */
-	CurrentSnapshotData.xmin = sourcesnap->xmin;
-	CurrentSnapshotData.xmax = sourcesnap->xmax;
-	CurrentSnapshotData.xcnt = sourcesnap->xcnt;
-	Assert(sourcesnap->xcnt <= GetMaxSnapshotXidCount());
-	if (sourcesnap->xcnt > 0)
-		memcpy(CurrentSnapshotData.xip, sourcesnap->xip,
-			   sourcesnap->xcnt * sizeof(TransactionId));
-	CurrentSnapshotData.subxcnt = sourcesnap->subxcnt;
-	Assert(sourcesnap->subxcnt <= GetMaxSnapshotSubxidCount());
-	if (sourcesnap->subxcnt > 0)
-		memcpy(CurrentSnapshotData.subxip, sourcesnap->subxip,
-			   sourcesnap->subxcnt * sizeof(TransactionId));
-	CurrentSnapshotData.suboverflowed = sourcesnap->suboverflowed;
-	CurrentSnapshotData.takenDuringRecovery = sourcesnap->takenDuringRecovery;
-	/* NB: curcid should NOT be copied, it's a local matter */
+	ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+	CurrentSnapshotData.shared = sourcesnap;
+	CurrentSnapshotData.shared->refcount++;
 
-	CurrentSnapshotData.snapXactCompletionCount = 0;
+	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
+	 * Now we have to fix what GetMVCCSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -555,13 +555,13 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	 */
 	if (sourceproc != NULL)
 	{
-		if (!ProcArrayInstallRestoredXmin(CurrentSnapshotData.xmin, sourceproc))
+		if (!ProcArrayInstallRestoredXmin(CurrentSnapshotData.shared->xmin, sourceproc))
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("could not import the requested snapshot"),
 					 errdetail("The source transaction is not running anymore.")));
 	}
-	else if (!ProcArrayInstallImportedXmin(CurrentSnapshotData.xmin, sourcevxid))
+	else if (!ProcArrayInstallImportedXmin(CurrentSnapshotData.shared->xmin, sourcevxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("could not import the requested snapshot"),
@@ -577,96 +577,22 @@ SetTransactionSnapshot(MVCCSnapshot sourcesnap, VirtualTransactionId *sourcevxid
 	if (IsolationUsesXactSnapshot())
 	{
 		if (IsolationIsSerializable())
-			SetSerializableTransactionSnapshot(&CurrentSnapshotData, sourcevxid,
-											   sourcepid);
-		/* Mark it as "registered" */
+			SetSerializableTransactionSnapshotData(CurrentSnapshotData.shared,
+												   sourcevxid, sourcepid);
+		/* keep it */
 		FirstXactSnapshotRegistered = true;
 	}
-	valid_snapshots_push_tail(&CurrentSnapshotData);
 
 	FirstSnapshotSet = true;
 }
 
-/*
- * CopyMVCCSnapshot
- *		Copy the given snapshot.
- *
- * The copy is palloc'd in TopTransactionContext and has initial refcounts set
- * to 0.  The returned snapshot has the copied flag set.
- */
-static MVCCSnapshot
-CopyMVCCSnapshot(MVCCSnapshot snapshot)
-{
-	MVCCSnapshot newsnap;
-	Size		subxipoff;
-	Size		size;
-
-	/* We allocate any XID arrays needed in the same palloc block. */
-	size = subxipoff = sizeof(MVCCSnapshotData) +
-		snapshot->xcnt * sizeof(TransactionId);
-	if (snapshot->subxcnt > 0)
-		size += snapshot->subxcnt * sizeof(TransactionId);
-
-	newsnap = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
-	memcpy(newsnap, snapshot, sizeof(MVCCSnapshotData));
-
-	newsnap->regd_count = 0;
-	newsnap->active_count = 0;
-	newsnap->copied = true;
-	newsnap->valid = true;
-	newsnap->snapXactCompletionCount = 0;
-
-	/* setup XID array */
-	if (snapshot->xcnt > 0)
-	{
-		newsnap->xip = (TransactionId *) (newsnap + 1);
-		memcpy(newsnap->xip, snapshot->xip,
-			   snapshot->xcnt * sizeof(TransactionId));
-	}
-	else
-		newsnap->xip = NULL;
-
-	/*
-	 * Setup subXID array. Don't bother to copy it if it had overflowed,
-	 * though, because it's not used anywhere in that case. Except if it's a
-	 * snapshot taken during recovery; all the top-level XIDs are in subxip as
-	 * well in that case, so we mustn't lose them.
-	 */
-	if (snapshot->subxcnt > 0 &&
-		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
-	{
-		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
-		memcpy(newsnap->subxip, snapshot->subxip,
-			   snapshot->subxcnt * sizeof(TransactionId));
-	}
-	else
-		newsnap->subxip = NULL;
-
-	return newsnap;
-}
-
-/*
- * FreeMVCCSnapshot
- *		Free the memory associated with a snapshot.
- */
-static void
-FreeMVCCSnapshot(MVCCSnapshot snapshot)
-{
-	Assert(snapshot->regd_count == 0);
-	Assert(snapshot->active_count == 0);
-	Assert(snapshot->copied);
-	Assert(snapshot->valid);
-
-	pfree(snapshot);
-}
-
 /*
  * PushActiveSnapshot
  *		Set the given snapshot as the current active snapshot
  *
  * If the passed snapshot is a statically-allocated one, or it is possibly
  * subject to a future command counter update, create a new long-lived copy
- * with active refcount=1.  Otherwise, only increment the refcount.
+ * with active refcount=1.  Otherwise, only increment the refcount. XXX
  *
  * Only regular MVCC snaphots can be used as the active snapshot.
  */
@@ -697,24 +623,13 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 	Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
 
 	newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt));
-
-	/*
-	 * Checking SecondarySnapshot is probably useless here, but it seems
-	 * better to be sure.
-	 */
-	if (!origsnap->copied)
-	{
-		newactive->as_snap = CopyMVCCSnapshot(origsnap);
-		dlist_insert_after(&origsnap->node, &newactive->as_snap->node);
-	}
-	else
-		newactive->as_snap = origsnap;
+	memcpy(&newactive->as_snap, origsnap, sizeof(MVCCSnapshotData));
+	newactive->as_snap.kind = SNAPSHOT_ACTIVE;
+	newactive->as_snap.shared->refcount++;
 
 	newactive->as_next = ActiveSnapshot;
 	newactive->as_level = snap_level;
 
-	newactive->as_snap->active_count++;
-
 	ActiveSnapshot = newactive;
 }
 
@@ -729,20 +644,20 @@ PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
 void
 PushCopiedSnapshot(Snapshot snapshot)
 {
-	MVCCSnapshot copy;
-
 	Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
 
-	copy = CopyMVCCSnapshot(&snapshot->mvcc);
-	dlist_insert_after(&snapshot->mvcc.node, &copy->node);
-	PushActiveSnapshot((Snapshot) copy);
+	/*
+	 * This used to be different from PushActiveSnapshot, but these days
+	 * PushActiveSnapshot creates a copy too and there's no difference.
+	 */
+	PushActiveSnapshot(snapshot);
 }
 
 /*
  * UpdateActiveSnapshotCommandId
  *
  * Update the current CID of the active snapshot.  This can only be applied
- * to a snapshot that is not referenced elsewhere.
+ * to a snapshot that is not referenced elsewhere. XXX
  */
 void
 UpdateActiveSnapshotCommandId(void)
@@ -751,8 +666,6 @@ UpdateActiveSnapshotCommandId(void)
 				curcid;
 
 	Assert(ActiveSnapshot != NULL);
-	Assert(ActiveSnapshot->as_snap->active_count == 1);
-	Assert(ActiveSnapshot->as_snap->regd_count == 0);
 
 	/*
 	 * Don't allow modification of the active snapshot during parallel
@@ -762,11 +675,12 @@ UpdateActiveSnapshotCommandId(void)
 	 * CommandCounterIncrement, but there are a few places that call this
 	 * directly, so we put an additional guard here.
 	 */
-	save_curcid = ActiveSnapshot->as_snap->curcid;
+	save_curcid = ActiveSnapshot->as_snap.curcid;
 	curcid = GetCurrentCommandId(false);
 	if (IsInParallelMode() && save_curcid != curcid)
 		elog(ERROR, "cannot modify commandid in active snapshot during a parallel operation");
-	ActiveSnapshot->as_snap->curcid = curcid;
+
+	ActiveSnapshot->as_snap.curcid = curcid;
 }
 
 /*
@@ -782,16 +696,7 @@ PopActiveSnapshot(void)
 
 	newstack = ActiveSnapshot->as_next;
 
-	Assert(ActiveSnapshot->as_snap->active_count > 0);
-
-	ActiveSnapshot->as_snap->active_count--;
-
-	if (ActiveSnapshot->as_snap->active_count == 0 &&
-		ActiveSnapshot->as_snap->regd_count == 0)
-	{
-		dlist_delete(&ActiveSnapshot->as_snap->node);
-		FreeMVCCSnapshot(ActiveSnapshot->as_snap);
-	}
+	ReleaseMVCCSnapshotShared(ActiveSnapshot->as_snap.shared);
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
@@ -808,7 +713,7 @@ GetActiveSnapshot(void)
 {
 	Assert(ActiveSnapshot != NULL);
 
-	return (Snapshot) ActiveSnapshot->as_snap;
+	return (Snapshot) &ActiveSnapshot->as_snap;
 }
 
 /*
@@ -844,7 +749,7 @@ RegisterSnapshot(Snapshot snapshot)
 Snapshot
 RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 {
-	MVCCSnapshot snapshot;
+	MVCCSnapshot newsnap;
 
 	if (orig_snapshot == InvalidSnapshot)
 		return InvalidSnapshot;
@@ -861,22 +766,19 @@ RegisterSnapshotOnOwner(Snapshot orig_snapshot, ResourceOwner owner)
 	}
 
 	Assert(orig_snapshot->snapshot_type == SNAPSHOT_MVCC);
-	snapshot = &orig_snapshot->mvcc;
-	Assert(snapshot->valid);
+	Assert(orig_snapshot->mvcc.valid);
 
-	/* Static snapshot?  Create a persistent copy */
-	if (!snapshot->copied)
-	{
-		snapshot = CopyMVCCSnapshot(snapshot);
-		dlist_insert_after(&orig_snapshot->mvcc.node, &snapshot->node);
-	}
+	/* Create a copy */
+	newsnap = MemoryContextAlloc(TopTransactionContext, sizeof(MVCCSnapshotData));
+	memcpy(newsnap, &orig_snapshot->mvcc, sizeof(MVCCSnapshotData));
+	newsnap->kind = SNAPSHOT_REGISTERED;
+	newsnap->shared->refcount++;
 
 	/* and tell resowner.c about it */
 	ResourceOwnerEnlarge(owner);
-	snapshot->regd_count++;
-	ResourceOwnerRememberSnapshot(owner, (Snapshot) snapshot);
+	ResourceOwnerRememberSnapshot(owner, (Snapshot) newsnap);
 
-	return (Snapshot) snapshot;
+	return (Snapshot) newsnap;
 }
 
 /*
@@ -914,18 +816,12 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 {
 	if (snapshot->snapshot_type == SNAPSHOT_MVCC)
 	{
-		MVCCSnapshot mvccsnap = &snapshot->mvcc;
-
-		Assert(mvccsnap->regd_count > 0);
+		Assert(snapshot->mvcc.kind == SNAPSHOT_REGISTERED);
 		Assert(!dlist_is_empty(&ValidSnapshots));
 
-		mvccsnap->regd_count--;
-		if (mvccsnap->regd_count == 0 && mvccsnap->active_count == 0)
-		{
-			dlist_delete(&mvccsnap->node);
-			FreeMVCCSnapshot(mvccsnap);
-			SnapshotResetXmin();
-		}
+		ReleaseMVCCSnapshotShared(snapshot->mvcc.shared);
+		pfree(snapshot);
+		SnapshotResetXmin();
 	}
 	else if (snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 	{
@@ -963,19 +859,21 @@ UnregisterSnapshotNoOwner(Snapshot snapshot)
 static void
 SnapshotResetXmin(void)
 {
-	MVCCSnapshot minSnapshot;
+	MVCCSnapshotShared minSnapshot;
 
 	/*
 	 * Invalidate these static snapshots so that we can advance xmin.
 	 */
 	if (!FirstXactSnapshotRegistered && CurrentSnapshotData.valid)
 	{
-		dlist_delete(&CurrentSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+		CurrentSnapshotData.shared = NULL;
 		CurrentSnapshotData.valid = false;
 	}
 	if (SecondarySnapshotData.valid)
 	{
-		dlist_delete(&SecondarySnapshotData.node);
+		ReleaseMVCCSnapshotShared(SecondarySnapshotData.shared);
+		SecondarySnapshotData.shared = NULL;
 		SecondarySnapshotData.valid = false;
 	}
 
@@ -988,7 +886,7 @@ SnapshotResetXmin(void)
 		return;
 	}
 
-	minSnapshot = dlist_head_element(MVCCSnapshotData, node, &ValidSnapshots);
+	minSnapshot = dlist_head_element(MVCCSnapshotSharedData, node, &ValidSnapshots);
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
 		MyProc->xmin = TransactionXmin = minSnapshot->xmin;
@@ -1028,21 +926,7 @@ AtSubAbort_Snapshot(int level)
 
 		next = ActiveSnapshot->as_next;
 
-		/*
-		 * Decrement the snapshot's active count.  If it's still registered or
-		 * marked as active by an outer subtransaction, we can't free it yet.
-		 */
-		Assert(ActiveSnapshot->as_snap->active_count >= 1);
-		ActiveSnapshot->as_snap->active_count -= 1;
-
-		if (ActiveSnapshot->as_snap->active_count == 0 &&
-			ActiveSnapshot->as_snap->regd_count == 0)
-		{
-			dlist_delete(&ActiveSnapshot->as_snap->node);
-			FreeMVCCSnapshot(ActiveSnapshot->as_snap);
-		}
-
-		/* and free the stack element */
+		ReleaseMVCCSnapshotShared(ActiveSnapshot->as_snap.shared);
 		pfree(ActiveSnapshot);
 
 		ActiveSnapshot = next;
@@ -1058,6 +942,8 @@ AtSubAbort_Snapshot(int level)
 void
 AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 {
+	dlist_mutable_iter iter;
+
 	/*
 	 * If we exported any snapshots, clean them up.
 	 */
@@ -1084,7 +970,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 				elog(WARNING, "could not unlink file \"%s\": %m",
 					 esnap->snapfile);
 
-			dlist_delete(&esnap->snapshot->node);
+			ReleaseMVCCSnapshotShared(esnap->snapshot);
 		}
 
 		exportedSnapshots = NIL;
@@ -1093,17 +979,20 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	/* Drop all static snapshot */
 	if (CatalogSnapshotData.valid)
 	{
-		dlist_delete(&CatalogSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CatalogSnapshotData.shared);
+		CatalogSnapshotData.shared = NULL;
 		CatalogSnapshotData.valid = false;
 	}
 	if (CurrentSnapshotData.valid)
 	{
-		dlist_delete(&CurrentSnapshotData.node);
+		ReleaseMVCCSnapshotShared(CurrentSnapshotData.shared);
+		CurrentSnapshotData.shared = NULL;
 		CurrentSnapshotData.valid = false;
 	}
 	if (SecondarySnapshotData.valid)
 	{
-		dlist_delete(&SecondarySnapshotData.node);
+		ReleaseMVCCSnapshotShared(SecondarySnapshotData.shared);
+		SecondarySnapshotData.shared = NULL;
 		SecondarySnapshotData.valid = false;
 	}
 
@@ -1124,11 +1013,23 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * And reset our state.  We don't need to free the memory explicitly --
 	 * it'll go away with TopTransactionContext.
 	 */
-	ActiveSnapshot = NULL;
-	dlist_init(&ValidSnapshots);
+	dlist_foreach_modify(iter, &ValidSnapshots)
+	{
+		MVCCSnapshotShared cur = dlist_container(MVCCSnapshotSharedData, node, iter.cur);
 
-	CurrentSnapshotData.valid = false;
-	SecondarySnapshotData.valid = false;
+		dlist_delete(iter.cur);
+		cur->refcount = 0;
+		if (cur == latestSnapshotShared)
+		{
+			/* keep it */
+		}
+		else if (spareSnapshotShared == NULL)
+			spareSnapshotShared = cur;
+		else
+			pfree(cur);
+	}
+
+	ActiveSnapshot = NULL;
 	FirstSnapshotSet = false;
 	FirstXactSnapshotRegistered = false;
 
@@ -1151,9 +1052,8 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
  *		snapshot.
  */
 char *
-ExportSnapshot(MVCCSnapshot snapshot)
+ExportSnapshot(MVCCSnapshotShared snapshot)
 {
-	MVCCSnapshot orig_snapshot;
 	TransactionId topXid;
 	TransactionId *children;
 	ExportedSnapshot *esnap;
@@ -1214,21 +1114,16 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	 * Copy the snapshot into TopTransactionContext, add it to the
 	 * exportedSnapshots list, and mark it pseudo-registered.  We do this to
 	 * ensure that the snapshot's xmin is honored for the rest of the
-	 * transaction.
+	 * transaction. XXX
 	 */
-	orig_snapshot = snapshot;
-	snapshot = CopyMVCCSnapshot(orig_snapshot);
-
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 	esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
 	esnap->snapfile = pstrdup(path);
 	esnap->snapshot = snapshot;
+	snapshot->refcount++;
 	exportedSnapshots = lappend(exportedSnapshots, esnap);
 	MemoryContextSwitchTo(oldcxt);
 
-	snapshot->regd_count++;
-	dlist_insert_after(&orig_snapshot->node, &snapshot->node);
-
 	/*
 	 * Fill buf with a text serialization of the snapshot, plus identification
 	 * data about this transaction.  The format expected by ImportSnapshot is
@@ -1248,8 +1143,8 @@ ExportSnapshot(MVCCSnapshot snapshot)
 	/*
 	 * We must include our own top transaction ID in the top-xid data, since
 	 * by definition we will still be running when the importing transaction
-	 * adopts the snapshot, but GetSnapshotData never includes our own XID in
-	 * the snapshot.  (There must, therefore, be enough room to add it.)
+	 * adopts the snapshot, but GetMVCCSnapshotData never includes our own XID
+	 * in the snapshot.  (There must, therefore, be enough room to add it.)
 	 *
 	 * However, it could be that our topXid is after the xmax, in which case
 	 * we shouldn't include it because xip[] members are expected to be before
@@ -1334,7 +1229,7 @@ pg_export_snapshot(PG_FUNCTION_ARGS)
 {
 	char	   *snapshotName;
 
-	snapshotName = ExportSnapshot((MVCCSnapshot) GetActiveSnapshot());
+	snapshotName = ExportSnapshot(((MVCCSnapshot) GetActiveSnapshot())->shared);
 	PG_RETURN_TEXT_P(cstring_to_text(snapshotName));
 }
 
@@ -1438,7 +1333,7 @@ ImportSnapshot(const char *idstr)
 	Oid			src_dbid;
 	int			src_isolevel;
 	bool		src_readonly;
-	MVCCSnapshotData snapshot;
+	MVCCSnapshotShared snapshot;
 
 	/*
 	 * Must be at top level of a fresh transaction.  Note in particular that
@@ -1508,8 +1403,6 @@ ImportSnapshot(const char *idstr)
 	/*
 	 * Construct a snapshot struct by parsing the file content.
 	 */
-	memset(&snapshot, 0, sizeof(snapshot));
-
 	parseVxidFromText("vxid:", &filebuf, path, &src_vxid);
 	src_pid = parseIntFromText("pid:", &filebuf, path);
 	/* we abuse parseXidFromText a bit here ... */
@@ -1517,12 +1410,11 @@ ImportSnapshot(const char *idstr)
 	src_isolevel = parseIntFromText("iso:", &filebuf, path);
 	src_readonly = parseIntFromText("ro:", &filebuf, path);
 
-	snapshot.snapshot_type = SNAPSHOT_MVCC;
-
-	snapshot.xmin = parseXidFromText("xmin:", &filebuf, path);
-	snapshot.xmax = parseXidFromText("xmax:", &filebuf, path);
+	snapshot = AllocMVCCSnapshotShared();
+	snapshot->xmin = parseXidFromText("xmin:", &filebuf, path);
+	snapshot->xmax = parseXidFromText("xmax:", &filebuf, path);
 
-	snapshot.xcnt = xcnt = parseIntFromText("xcnt:", &filebuf, path);
+	snapshot->xcnt = xcnt = parseIntFromText("xcnt:", &filebuf, path);
 
 	/* sanity-check the xid count before palloc */
 	if (xcnt < 0 || xcnt > GetMaxSnapshotXidCount())
@@ -1530,15 +1422,15 @@ ImportSnapshot(const char *idstr)
 				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 				 errmsg("invalid snapshot data in file \"%s\"", path)));
 
-	snapshot.xip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
+	snapshot->xip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
 	for (i = 0; i < xcnt; i++)
-		snapshot.xip[i] = parseXidFromText("xip:", &filebuf, path);
+		snapshot->xip[i] = parseXidFromText("xip:", &filebuf, path);
 
-	snapshot.suboverflowed = parseIntFromText("sof:", &filebuf, path);
+	snapshot->suboverflowed = parseIntFromText("sof:", &filebuf, path);
 
-	if (!snapshot.suboverflowed)
+	if (!snapshot->suboverflowed)
 	{
-		snapshot.subxcnt = xcnt = parseIntFromText("sxcnt:", &filebuf, path);
+		snapshot->subxcnt = xcnt = parseIntFromText("sxcnt:", &filebuf, path);
 
 		/* sanity-check the xid count before palloc */
 		if (xcnt < 0 || xcnt > GetMaxSnapshotSubxidCount())
@@ -1546,17 +1438,19 @@ ImportSnapshot(const char *idstr)
 					(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 					 errmsg("invalid snapshot data in file \"%s\"", path)));
 
-		snapshot.subxip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
+		snapshot->subxip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
 		for (i = 0; i < xcnt; i++)
-			snapshot.subxip[i] = parseXidFromText("sxp:", &filebuf, path);
+			snapshot->subxip[i] = parseXidFromText("sxp:", &filebuf, path);
 	}
 	else
 	{
-		snapshot.subxcnt = 0;
-		snapshot.subxip = NULL;
+		snapshot->subxcnt = 0;
 	}
 
-	snapshot.takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
+	snapshot->takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
+
+	snapshot->refcount = 1;
+	valid_snapshots_push_out_of_order(snapshot);
 
 	/*
 	 * Do some additional sanity checking, just to protect ourselves.  We
@@ -1565,8 +1459,8 @@ ImportSnapshot(const char *idstr)
 	 */
 	if (!VirtualTransactionIdIsValid(src_vxid) ||
 		!OidIsValid(src_dbid) ||
-		!TransactionIdIsNormal(snapshot.xmin) ||
-		!TransactionIdIsNormal(snapshot.xmax))
+		!TransactionIdIsNormal(snapshot->xmin) ||
+		!TransactionIdIsNormal(snapshot->xmax))
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 				 errmsg("invalid snapshot data in file \"%s\"", path)));
@@ -1604,7 +1498,7 @@ ImportSnapshot(const char *idstr)
 				 errmsg("cannot import a snapshot from a different database")));
 
 	/* OK, install the snapshot */
-	SetTransactionSnapshot(&snapshot, &src_vxid, src_pid, NULL);
+	SetTransactionSnapshot(snapshot, &src_vxid, src_pid, NULL);
 }
 
 /*
@@ -1670,18 +1564,21 @@ ThereAreNoPriorRegisteredSnapshots(void)
 
 	dlist_foreach(iter, &ValidSnapshots)
 	{
-		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+		MVCCSnapshotShared cur =
+			dlist_container(MVCCSnapshotSharedData, node, iter.cur);
+		uint32		allowedcount = 0;
 
 		if (FirstXactSnapshotRegistered)
 		{
 			Assert(CurrentSnapshotData.valid);
-			if (cur != &CurrentSnapshotData)
-				continue;
+			if (cur == CurrentSnapshotData.shared)
+				allowedcount++;
 		}
-		if (ActiveSnapshot && cur == ActiveSnapshot->as_snap)
-			continue;
+		if (ActiveSnapshot && cur == ActiveSnapshot->as_snap.shared)
+			allowedcount++;
 
-		return false;
+		if (cur->refcount != allowedcount)
+			return false;
 	}
 
 	return true;
@@ -1707,8 +1604,9 @@ HaveRegisteredOrActiveSnapshot(void)
 	 * registered more than one snapshot has to be in ValidSnapshots.
 	 */
 	if (CatalogSnapshotData.valid &&
-		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.node &&
-		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.node)
+		CatalogSnapshotData.shared->refcount == 1 &&
+		dlist_head_node(&ValidSnapshots) == &CatalogSnapshotData.shared->node &&
+		dlist_tail_node(&ValidSnapshots) == &CatalogSnapshotData.shared->node)
 	{
 		return false;
 	}
@@ -1775,11 +1673,11 @@ EstimateSnapshotSpace(MVCCSnapshot snapshot)
 
 	/* We allocate any XID arrays needed in the same palloc block. */
 	size = add_size(sizeof(SerializedSnapshotData),
-					mul_size(snapshot->xcnt, sizeof(TransactionId)));
-	if (snapshot->subxcnt > 0 &&
-		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
+					mul_size(snapshot->shared->xcnt, sizeof(TransactionId)));
+	if (snapshot->shared->subxcnt > 0 &&
+		(!snapshot->shared->suboverflowed || snapshot->shared->takenDuringRecovery))
 		size = add_size(size,
-						mul_size(snapshot->subxcnt, sizeof(TransactionId)));
+						mul_size(snapshot->shared->subxcnt, sizeof(TransactionId)));
 
 	return size;
 }
@@ -1794,15 +1692,15 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 {
 	SerializedSnapshotData serialized_snapshot;
 
-	Assert(snapshot->subxcnt >= 0);
+	Assert(snapshot->shared->subxcnt >= 0);
 
 	/* Copy all required fields */
-	serialized_snapshot.xmin = snapshot->xmin;
-	serialized_snapshot.xmax = snapshot->xmax;
-	serialized_snapshot.xcnt = snapshot->xcnt;
-	serialized_snapshot.subxcnt = snapshot->subxcnt;
-	serialized_snapshot.suboverflowed = snapshot->suboverflowed;
-	serialized_snapshot.takenDuringRecovery = snapshot->takenDuringRecovery;
+	serialized_snapshot.xmin = snapshot->shared->xmin;
+	serialized_snapshot.xmax = snapshot->shared->xmax;
+	serialized_snapshot.xcnt = snapshot->shared->xcnt;
+	serialized_snapshot.subxcnt = snapshot->shared->subxcnt;
+	serialized_snapshot.suboverflowed = snapshot->shared->suboverflowed;
+	serialized_snapshot.takenDuringRecovery = snapshot->shared->takenDuringRecovery;
 	serialized_snapshot.curcid = snapshot->curcid;
 
 	/*
@@ -1810,7 +1708,7 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 	 * taken during recovery - in that case, top-level XIDs are in subxip as
 	 * well, and we mustn't lose them.
 	 */
-	if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery)
+	if (serialized_snapshot.suboverflowed && !snapshot->shared->takenDuringRecovery)
 		serialized_snapshot.subxcnt = 0;
 
 	/* Copy struct to possibly-unaligned buffer */
@@ -1818,10 +1716,10 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 		   &serialized_snapshot, sizeof(SerializedSnapshotData));
 
 	/* Copy XID array */
-	if (snapshot->xcnt > 0)
+	if (snapshot->shared->xcnt > 0)
 		memcpy((TransactionId *) (start_address +
 								  sizeof(SerializedSnapshotData)),
-			   snapshot->xip, snapshot->xcnt * sizeof(TransactionId));
+			   snapshot->shared->xip, snapshot->shared->xcnt * sizeof(TransactionId));
 
 	/*
 	 * Copy SubXID array. Don't bother to copy it if it had overflowed,
@@ -1832,10 +1730,10 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 	if (serialized_snapshot.subxcnt > 0)
 	{
 		Size		subxipoff = sizeof(SerializedSnapshotData) +
-			snapshot->xcnt * sizeof(TransactionId);
+			snapshot->shared->xcnt * sizeof(TransactionId);
 
 		memcpy((TransactionId *) (start_address + subxipoff),
-			   snapshot->subxip, snapshot->subxcnt * sizeof(TransactionId));
+			   snapshot->shared->subxip, snapshot->shared->subxcnt * sizeof(TransactionId));
 	}
 }
 
@@ -1863,49 +1761,46 @@ RestoreSnapshot(char *start_address)
 	size = sizeof(MVCCSnapshotData)
 		+ serialized_snapshot.xcnt * sizeof(TransactionId)
 		+ serialized_snapshot.subxcnt * sizeof(TransactionId);
+	Assert(serialized_snapshot.xcnt <= GetMaxSnapshotXidCount());
+	Assert(serialized_snapshot.subxcnt <= GetMaxSnapshotSubxidCount());
 
 	/* Copy all required fields */
 	snapshot = (MVCCSnapshot) MemoryContextAlloc(TopTransactionContext, size);
 	snapshot->snapshot_type = SNAPSHOT_MVCC;
-	snapshot->xmin = serialized_snapshot.xmin;
-	snapshot->xmax = serialized_snapshot.xmax;
-	snapshot->xip = NULL;
-	snapshot->xcnt = serialized_snapshot.xcnt;
-	snapshot->subxip = NULL;
-	snapshot->subxcnt = serialized_snapshot.subxcnt;
-	snapshot->suboverflowed = serialized_snapshot.suboverflowed;
-	snapshot->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
+	snapshot->kind = SNAPSHOT_REGISTERED;
+	snapshot->shared = AllocMVCCSnapshotShared();
+	snapshot->shared->xmin = serialized_snapshot.xmin;
+	snapshot->shared->xmax = serialized_snapshot.xmax;
+	snapshot->shared->xcnt = serialized_snapshot.xcnt;
+	snapshot->shared->subxcnt = serialized_snapshot.subxcnt;
+	snapshot->shared->suboverflowed = serialized_snapshot.suboverflowed;
+	snapshot->shared->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
+	snapshot->shared->snapXactCompletionCount = 0;
+
+	snapshot->shared->refcount = 1;
+	valid_snapshots_push_out_of_order(snapshot->shared);
+
 	snapshot->curcid = serialized_snapshot.curcid;
-	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
 	{
-		snapshot->xip = (TransactionId *) (snapshot + 1);
-		memcpy(snapshot->xip, serialized_xids,
+		memcpy(snapshot->shared->xip, serialized_xids,
 			   serialized_snapshot.xcnt * sizeof(TransactionId));
 	}
 
 	/* Copy SubXIDs, if present. */
 	if (serialized_snapshot.subxcnt > 0)
 	{
-		snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
-			serialized_snapshot.xcnt;
-		memcpy(snapshot->subxip, serialized_xids + serialized_snapshot.xcnt,
+		memcpy(snapshot->shared->subxip, serialized_xids + serialized_snapshot.xcnt,
 			   serialized_snapshot.subxcnt * sizeof(TransactionId));
 	}
 
-	/* Set the copied flag so that the caller will set refcounts correctly. */
-	snapshot->regd_count = 0;
-	snapshot->active_count = 0;
-	snapshot->copied = true;
 	snapshot->valid = true;
 
 	/* and tell resowner.c about it, just like RegisterSnapshot() */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	snapshot->regd_count++;
 	ResourceOwnerRememberSnapshot(CurrentResourceOwner, (Snapshot) snapshot);
-	valid_snapshots_push_out_of_order(snapshot);
 
 	return snapshot;
 }
@@ -1919,21 +1814,21 @@ RestoreSnapshot(char *start_address)
 void
 RestoreTransactionSnapshot(MVCCSnapshot snapshot, void *source_pgproc)
 {
-	SetTransactionSnapshot(snapshot, NULL, InvalidPid, source_pgproc);
+	SetTransactionSnapshot(snapshot->shared, NULL, InvalidPid, source_pgproc);
 }
 
 /*
  * XidInMVCCSnapshot
  *		Is the given XID still-in-progress according to the snapshot?
  *
- * Note: GetSnapshotData never stores either top xid or subxids of our own
- * backend into a snapshot, so these xids will not be reported as "running"
- * by this function.  This is OK for current uses, because we always check
- * TransactionIdIsCurrentTransactionId first, except when it's known the
- * XID could not be ours anyway.
+ * Note: GetMVCCSnapshotData never stores either top xid or subxids of our own
+ * backend into a snapshot, so these xids will not be reported as "running" by
+ * this function.  This is OK for current uses, because we always check
+ * TransactionIdIsCurrentTransactionId first, except when it's known the XID
+ * could not be ours anyway.
  */
 bool
-XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot)
+XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot)
 {
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
@@ -2029,6 +1924,84 @@ XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot)
 	return false;
 }
 
+/*
+ * Allocate an MVCCSnapshotShared struct
+ *
+ * The 'xip' and 'subxip' arrays are allocated so that they can hold the max
+ * number of XIDs. That's usually overkill, but it allows us to do the
+ * allocation while not holding ProcArrayLock.
+ *
+ * MVCCSnapshotShared structs are kept in TopMemoryContext and refcounted.
+ * The refcount is initially zero, the caller is expected to increment it.
+ */
+MVCCSnapshotShared
+AllocMVCCSnapshotShared(void)
+{
+	MemoryContext save_cxt;
+	MVCCSnapshotShared shared;
+	size_t		size;
+	char	   *p;
+
+	/*
+	 * To reduce alloc/free overhead in GetMVCCSnapshotData(), we have a
+	 * single-element pool.
+	 */
+	if (spareSnapshotShared)
+	{
+		shared = spareSnapshotShared;
+		spareSnapshotShared = NULL;
+		return shared;
+	}
+
+	save_cxt = MemoryContextSwitchTo(TopMemoryContext);
+
+	size = sizeof(MVCCSnapshotSharedData) +
+		GetMaxSnapshotXidCount() * sizeof(TransactionId) +
+		GetMaxSnapshotSubxidCount() * sizeof(TransactionId);
+	p = palloc(size);
+
+	shared = (MVCCSnapshotShared) p;
+	p += sizeof(MVCCSnapshotSharedData);
+	shared->xip = (TransactionId *) p;
+	p += GetMaxSnapshotXidCount() * sizeof(TransactionId);
+	shared->subxip = (TransactionId *) p;
+
+	shared->snapXactCompletionCount = 0;
+	shared->refcount = 0;
+
+	MemoryContextSwitchTo(save_cxt);
+
+	return shared;
+}
+
+/*
+ * Decrement the refcount on an MVCCSnapshotShared struct, freeing it if it
+ * reaches zero.
+ */
+static void
+ReleaseMVCCSnapshotShared(MVCCSnapshotShared shared)
+{
+	Assert(shared->refcount > 0);
+	shared->refcount--;
+
+	if (shared->refcount == 0)
+	{
+		dlist_delete(&shared->node);
+		if (shared != latestSnapshotShared)
+			FreeMVCCSnapshotShared(shared);
+	}
+}
+
+void
+FreeMVCCSnapshotShared(MVCCSnapshotShared shared)
+{
+	Assert(shared->refcount == 0);
+	if (spareSnapshotShared == NULL)
+		spareSnapshotShared = shared;
+	else
+		pfree(shared);
+}
+
 /* ResourceOwner callbacks */
 
 static void
@@ -2042,12 +2015,13 @@ ResOwnerReleaseSnapshot(Datum res)
 
 /* dlist_push_tail, with assertion that the list stays ordered by xmin */
 static void
-valid_snapshots_push_tail(MVCCSnapshot snapshot)
+valid_snapshots_push_tail(MVCCSnapshotShared snapshot)
 {
 #ifdef USE_ASSERT_CHECKING
 	if (!dlist_is_empty(&ValidSnapshots))
 	{
-		MVCCSnapshot tail = dlist_tail_element(MVCCSnapshotData, node, &ValidSnapshots);
+		MVCCSnapshotShared tail =
+			dlist_tail_element(MVCCSnapshotSharedData, node, &ValidSnapshots);
 
 		Assert(TransactionIdFollowsOrEquals(snapshot->xmin, tail->xmin));
 	}
@@ -2062,13 +2036,14 @@ valid_snapshots_push_tail(MVCCSnapshot snapshot)
  * the list is small.
  */
 static void
-valid_snapshots_push_out_of_order(MVCCSnapshot snapshot)
+valid_snapshots_push_out_of_order(MVCCSnapshotShared snapshot)
 {
 	dlist_iter	iter;
 
 	dlist_foreach(iter, &ValidSnapshots)
 	{
-		MVCCSnapshot cur = dlist_container(MVCCSnapshotData, node, iter.cur);
+		MVCCSnapshotShared cur =
+			dlist_container(MVCCSnapshotSharedData, node, iter.cur);
 
 		if (TransactionIdFollowsOrEquals(snapshot->xmin, cur->xmin))
 		{
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..e71c660118e 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -242,8 +242,8 @@ typedef struct TransamVariablesData
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
 	 * the server. This currently is solely used to check whether
-	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
-	 * not. There are likely other users of this.  Always above 1.
+	 * GetMVCCSnapshotData() needs to recompute the contents of the snapshot,
+	 * or not. There are likely other users of this.  Always above 1.
 	 */
 	uint64		xactCompletionCount;
 
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 6a78dfeac96..e68862576ee 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -47,10 +47,10 @@ extern void CheckPointPredicate(void);
 extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
 
 /* predicate lock maintenance */
-extern MVCCSnapshot GetSerializableTransactionSnapshot(MVCCSnapshot snapshot);
-extern void SetSerializableTransactionSnapshot(MVCCSnapshot snapshot,
-											   VirtualTransactionId *sourcevxid,
-											   int sourcepid);
+extern MVCCSnapshotShared GetSerializableTransactionSnapshotData(void);
+extern void SetSerializableTransactionSnapshotData(MVCCSnapshotShared snapshot,
+												   VirtualTransactionId *sourcevxid,
+												   int sourcepid);
 extern void RegisterPredicateLockingXid(TransactionId xid);
 extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
 extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f51b03d3822..46b58a17489 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -324,7 +324,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
  * Adding/Removing an entry into the procarray requires holding *both*
  * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
  * needed because the dense arrays (see below) are accessed from
- * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * GetNewTransactionId() and GetMVCCSnapshotData(), and we don't want to add
  * further contention by both using the same lock. Adding/Removing a procarray
  * entry is much less frequent.
  *
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 7f5727c2586..8eedc2d6b9f 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -44,7 +44,7 @@ extern void KnownAssignedTransactionIdsIdleMaintenance(void);
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
 
-extern MVCCSnapshot GetSnapshotData(MVCCSnapshot snapshot);
+extern MVCCSnapshotShared GetMVCCSnapshotData(void);
 
 extern bool ProcArrayInstallImportedXmin(TransactionId xmin,
 										 VirtualTransactionId *sourcevxid);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 1f627ff966d..36c6043740f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -56,6 +56,13 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+/* exported so that GetMVCCSnapshotData() can access these */
+extern MVCCSnapshotShared latestSnapshotShared;
+extern MVCCSnapshotShared spareSnapshotShared;
+
+extern MVCCSnapshotShared AllocMVCCSnapshotShared(void);
+extern void FreeMVCCSnapshotShared(MVCCSnapshotShared shared);
+
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
@@ -89,7 +96,7 @@ extern void WaitForOlderSnapshots(TransactionId limitXmin, bool progress);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
 extern bool HaveRegisteredOrActiveSnapshot(void);
 
-extern char *ExportSnapshot(MVCCSnapshot snapshot);
+extern char *ExportSnapshot(MVCCSnapshotShared snapshot);
 
 /*
  * These live in procarray.c because they're intimately linked to the
@@ -105,7 +112,7 @@ extern bool GlobalVisCheckRemovableFullXid(Relation rel, FullTransactionId fxid)
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
-extern bool XidInMVCCSnapshot(TransactionId xid, MVCCSnapshot snapshot);
+extern bool XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 44b3b20f73c..193366ce052 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -119,17 +119,44 @@ typedef enum SnapshotType
 	SNAPSHOT_NON_VACUUMABLE,
 } SnapshotType;
 
+typedef struct MVCCSnapshotSharedData *MVCCSnapshotShared;
+
+typedef enum MVCCSnapshotKind
+{
+	SNAPSHOT_STATIC,
+	SNAPSHOT_ACTIVE,
+	SNAPSHOT_REGISTERED,
+} MVCCSnapshotKind;
+
 /*
  * Struct representing a normal MVCC snapshot.
  *
  * MVCC snapshots come in two variants: those taken during recovery in hot
  * standby mode, and "normal" MVCC snapshots.  They are distinguished by
- * takenDuringRecovery.
+ * shared->takenDuringRecovery.
  */
 typedef struct MVCCSnapshotData
 {
 	SnapshotType snapshot_type; /* type of snapshot, must be first */
 
+	/*
+	 * Most fields are in this separate struct which can be reused and shared
+	 * between snapshots that only differ in the command ID.  It is reference
+	 * counted separately.
+	 */
+	MVCCSnapshotShared shared;
+
+	CommandId	curcid;			/* in my xact, CID < curcid are visible */
+
+	/*
+	 * Book-keeping information, used by the snapshot manager
+	 */
+	MVCCSnapshotKind kind;
+	bool		valid;
+} MVCCSnapshotData;
+
+typedef struct MVCCSnapshotSharedData
+{
 	/*
 	 * An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
 	 * the effects of all older XIDs except those listed in the snapshot. xmin
@@ -160,25 +187,17 @@ typedef struct MVCCSnapshotData
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
-	bool		copied;			/* false if it's a static snapshot */
-	bool		valid;			/* is this snapshot valid? */
-
-	CommandId	curcid;			/* in my xact, CID < curcid are visible */
-
-	/*
-	 * Book-keeping information, used by the snapshot manager
-	 */
-	uint32		active_count;	/* refcount on ActiveSnapshot stack */
-	uint32		regd_count;		/* refcount of registrations in resowners */
-	dlist_node	node;			/* link in ValidSnapshots */
 
 	/*
-	 * The transaction completion count at the time GetSnapshotData() built
-	 * this snapshot. Allows to avoid re-computing static snapshots when no
-	 * transactions completed since the last GetSnapshotData().
+	 * The transaction completion count at the time GetMVCCSnapshotData()
+	 * built this snapshot. Allows to avoid re-computing static snapshots when
+	 * no transactions completed since the last GetMVCCSnapshotData().
 	 */
 	uint64		snapXactCompletionCount;
-} MVCCSnapshotData;
+
+	uint32		refcount;
+	dlist_node	node;			/* link in ValidSnapshots */
+} MVCCSnapshotSharedData;
 
 typedef struct MVCCSnapshotData *MVCCSnapshot;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c8ed18cf580..990c83c902a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1636,6 +1636,8 @@ MINIDUMP_TYPE
 MJEvalResult
 MTTargetRelLookup
 MVCCSnapshotData
+MVCCSnapshotKind
+MVCCSnapshotSharedData
 MVDependencies
 MVDependency
 MVNDistinct
-- 
2.39.5

v7-0008-XXX-add-perf-test.patchapplication/octet-stream; name=v7-0008-XXX-add-perf-test.patch; x-unix-mode=0644Download

From 511f67bc9579c5fcec923fa0fcb20370547561f2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 31 Mar 2025 22:29:44 +0300
Subject: [PATCH v6 08/12] XXX: add perf test

This is not intended to be merged. But it's been useful to have this
in the tree for some quick perf testing during development.

To run it, I've used:

(cd build-release && ninja &&  rm -rf tmp_install && meson test --suite setup --suite test_misc; grep TEST testrun/test_misc/000_csn_perf/log/regress_log_000_csn_perf )

It runs the other test_misc tests concurrently, but they finish a lot
faster so they don't affect the results much.
---
 src/test/modules/test_misc/meson.build       |   1 +
 src/test/modules/test_misc/t/000_csn_perf.pl | 337 +++++++++++++++++++
 2 files changed, 338 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/000_csn_perf.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 9c50de7efb0..1c385123448 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -9,6 +9,7 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
+      't/000_csn_perf.pl',
       't/001_constraint_validation.pl',
       't/002_tablespace.pl',
       't/003_check_guc.pl',
diff --git a/src/test/modules/test_misc/t/000_csn_perf.pl b/src/test/modules/test_misc/t/000_csn_perf.pl
new file mode 100644
index 00000000000..3915878a407
--- /dev/null
+++ b/src/test/modules/test_misc/t/000_csn_perf.pl
@@ -0,0 +1,337 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Verify that ALTER TABLE optimizes certain operations as expected
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(time);
+
+my $duration = 15; # seconds
+my $miniterations = 3;
+
+# Initialize a test cluster
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+# Turn message level up to DEBUG1 so that we get the messages we want to see
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->append_conf('postgresql.conf', 'max_connections = 1005');
+$primary->start;
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->append_conf('postgresql.conf', "shared_buffers='1 GB'");
+$replica->start;
+
+sub wait_catchup
+{
+	my ($primary, $replica) = @_;
+	
+	my $primary_lsn =
+	  $primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+	my $caughtup_query =
+	  "SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()";
+	$replica->poll_query_until('postgres', $caughtup_query)
+	  or die "Timed out while waiting for standby to catch up";
+}
+
+sub repeat_and_time_sql
+{
+  	my ($name, $node, $sql) = @_;
+
+	my $session =  $node->background_psql('postgres', on_error_die => 1);
+	$session->query_safe("SET max_parallel_workers_per_gather=0");
+
+	my $iterations = 0;
+
+	my $now;
+	my $elapsed;
+    my $begin_time = time();
+	while (1) {
+		$session->query_safe($sql);
+		$now = time();
+		$iterations = $iterations + 1;
+
+		$elapsed = $now - $begin_time;
+		if ($elapsed > $duration && $iterations >= $miniterations) {
+			last;
+		}
+	}
+
+	my $periter = $elapsed / $iterations;
+
+	pass ("TEST $name: $elapsed s, $iterations iterations, $periter s / iteration");
+}
+
+
+$primary->safe_psql('postgres', "CREATE TABLE little (i int);");
+$primary->safe_psql('postgres', "INSERT INTO little VALUES (1);");
+
+sub consume_xids
+{
+	my ($node) = @_;
+
+	my $session = $node->background_psql('postgres', on_error_die => 1);
+	for(my $i = 0; $i < 20; $i++) {
+		$session->query_safe(q{do $$
+  begin
+    for i in 1..50 loop
+      begin
+        DELETE from little;
+        perform 1 / 0;
+      exception
+        when division_by_zero then perform 0 /* do nothing */;
+        when others then raise 'fail: %', sqlerrm;
+      end;
+    end loop;
+  end
+$$;});
+	}
+	$session->quit;
+}
+
+# TEST few-xacts
+#
+# Cycle through 4 different top-level XIDs
+#
+# 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 4;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts
+#
+# like few-xacts, but we cycle through 100 different XIDs instead of 4.
+#
+# 1001, 1002, 1003, ... 1100, 1001, 1002, 1003, ... 1100  ....
+#
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST many-xacts-wide-apart
+#
+# like many-xacts, but the XIDs are more spread out, so that they don't fit in the
+# SLRU caches.
+#
+# 1000, 2000, 3000, 4000, ....
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my @primary_sessions = ();
+	my $num_connections = 100;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+
+		consume_xids($primary);
+
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_connections = $i;");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-xacts-wide-apart", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: few-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 4;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("few-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+
+# TEST: many-subxacts
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: many-subxacts-wide-apart
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "INSERT INTO tbl SELECT g FROM generate_series(1, 100000) g;");
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		consume_xids($primary);
+		$primary_session->query_safe("savepoint sp$i;");
+		$primary_session->query_safe("DELETE FROM tbl WHERE i % $num_subxacts = $i;");
+		$primary_session->query_safe("release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("many-subxacts-wide-apart", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-xids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+
+	my @primary_sessions = ();
+	my $num_connections = 1000;
+	for(my $i = 0; $i < $num_connections; $i++) {
+		my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+		$primary_session->query_safe("BEGIN;");
+		$primary_session->query_safe("INSERT INTO tbl VALUES ($i)");
+		push(@primary_sessions, $primary_session);
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-xids", $replica, "select count(*) from tbl");
+
+	for(my $i = 0; $i < $num_connections; $i++) {
+		$primary_sessions[$i]->quit;
+	}
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+# TEST: insert-all-different-subxids
+if (1)
+{
+	$primary->safe_psql('postgres', 'CREATE TABLE tbl(i int)');
+	$primary->safe_psql('postgres', "VACUUM FREEZE tbl;");
+
+	my $primary_session =  $primary->background_psql('postgres', on_error_die => 1);
+	$primary_session->query_safe("BEGIN;");
+	my $num_subxacts = 1000;
+	for(my $i = 0; $i < $num_subxacts; $i++) {
+		$primary_session->query_safe("savepoint sp$i; INSERT INTO tbl VALUES($i); release savepoint sp$i;");
+	}
+
+	# Consume one more XID, to bump up "last committed XID"
+	$primary->safe_psql('postgres', "select txid_current()");
+
+	wait_catchup($primary, $replica);
+
+	repeat_and_time_sql("insert-all-different-subxids", $replica, "select count(*) from tbl");
+
+	$primary_session->quit;
+	$primary->safe_psql('postgres', "DROP TABLE tbl");
+}
+
+done_testing();
-- 
2.39.5

v7-0009-Use-CSN-snapshots-during-Hot-Standby.patchapplication/octet-stream; name=v7-0009-Use-CSN-snapshots-during-Hot-Standby.patch; x-unix-mode=0644Download

From 7fec26347c80d42f0243f0d3328b38c69105a41f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 00:16:17 +0300
Subject: [PATCH v6 09/12] Use CSN snapshots during Hot Standby

Replace the known-assigned-XIDs mechanism with a CSN log. The CSN log
(pg_csn) tracks the commit LSN of each transaction, when replaying the
WAL on a standby. It's only used on the standby, and is initialized
from scratch at server startup like pg_subtrans.

Based on 0001-CSN-base-snapshot.patch from
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca.
This patch has a long lineage, various CSN patches have been posted
with parts from Stas Kelvich, Movead Li, Ants Aasma, Heikki
Linnakangas, Alexander Kuzmenkov
---
 contrib/pg_visibility/pg_visibility.c         |    1 +
 src/backend/access/rmgrdesc/xactdesc.c        |   26 -
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/csn_log.c          |  469 +++++
 src/backend/access/transam/meson.build        |    1 +
 src/backend/access/transam/transam.c          |    3 +
 src/backend/access/transam/twophase.c         |   34 +-
 src/backend/access/transam/varsup.c           |    1 +
 src/backend/access/transam/xact.c             |  138 +-
 src/backend/access/transam/xlog.c             |  118 +-
 src/backend/access/transam/xlogrecovery.c     |   13 +-
 src/backend/access/transam/xlogutils.c        |    2 +-
 src/backend/backup/basebackup.c               |    3 +
 src/backend/postmaster/startup.c              |    2 +-
 src/backend/replication/logical/decode.c      |    8 -
 src/backend/replication/logical/snapbuild.c   |    2 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/ipc/procarray.c           | 1538 ++---------------
 src/backend/storage/ipc/standby.c             |  102 +-
 src/backend/storage/lmgr/lwlock.c             |    2 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/backend/utils/probes.d                    |    2 +
 src/backend/utils/time/snapmgr.c              |   34 +-
 src/bin/initdb/initdb.c                       |    3 +-
 src/bin/pg_rewind/filemap.c                   |    3 +
 src/include/access/csn_log.h                  |   30 +
 src/include/access/transam.h                  |    3 +
 src/include/access/twophase.h                 |    3 +-
 src/include/access/xact.h                     |   12 +-
 src/include/access/xlogutils.h                |   33 +-
 src/include/storage/lwlock.h                  |    2 +
 src/include/storage/procarray.h               |   13 +-
 src/include/utils/snapshot.h                  |    8 +
 33 files changed, 821 insertions(+), 1793 deletions(-)
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/include/access/csn_log.h

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d79ef35006b..c5c7a4dd2c3 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -607,6 +607,7 @@ collect_visibility_data(Oid relid, bool include_pd)
  *    now perform minimal checking on a standby by always using nextXid, this
  *    approach is better than nothing and will at least catch extremely broken
  *    cases where a xid is in the future.
+ *    XXX KnownAssignedXids is gone.
  * 3. Ignore walsender xmin, because it could go backward if some replication
  *    connections don't use replication slots.
  *
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 715cc1f7bad..56f7bd81780 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -422,17 +422,6 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
 						 timestamptz_to_str(parsed.origin_timestamp));
 }
 
-static void
-xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
-{
-	int			i;
-
-	appendStringInfoString(buf, "subxacts:");
-
-	for (i = 0; i < xlrec->nsubxacts; i++)
-		appendStringInfo(buf, " %u", xlrec->xsub[i]);
-}
-
 void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -460,18 +449,6 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
-
-		/*
-		 * Note that we ignore the WAL record's xid, since we're more
-		 * interested in the top-level xid that issued the record and which
-		 * xids are being reported here.
-		 */
-		appendStringInfo(buf, "xtop %u: ", xlrec->xtop);
-		xact_desc_assignment(buf, xlrec);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		xl_xact_invals *xlrec = (xl_xact_invals *) rec;
@@ -503,9 +480,6 @@ xact_identify(uint8 info)
 		case XLOG_XACT_ABORT_PREPARED:
 			id = "ABORT_PREPARED";
 			break;
-		case XLOG_XACT_ASSIGNMENT:
-			id = "ASSIGNMENT";
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			id = "INVALIDATION";
 			break;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..2520d77c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 00000000000..40673c8579f
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,469 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ *		Track commit record LSNs of finished transactions
+ *
+ * This module provides an SLRU to store the LSN of the commit record of each
+ * transaction. CSN stands for Commit Sequence Number, and in principle we
+ * could use a separate counter that is incremented at every commit. For
+ * simplicity, though, we use the commit records LSN as the sequence number.
+ *
+ * Like pg_subtrans, this mapping need to be kept only for xid's greater then
+ * oldestXmin, and doesn't need to be preserved over crashes.  Also, this is
+ * only needed in hot standby mode, and immediately after exiting hot standby
+ * mode, until all old snapshots taken during standby mode are gone.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Defines for CSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XLogRecPtr))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+#define PgIndexToTransactionId(pageno, idx) (CSN_LOG_XACTS_PER_PAGE * (pageno) + idx)
+
+
+
+/*
+ * Link to shared-memory data structures for CSNLog control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int	ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int64 page1, int64 page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+								TransactionId *subxids,
+								XLogRecPtr csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn,
+							   int slotno);
+
+
+/*
+ * Record commit LSN of a transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transaction ID for a top level commit.
+ *
+ * subxids is an array of xids of length nsubxids, in logical XID order,
+ * representing subtransactions in the tree of XIDs. In various cases nsubxids
+ * may be zero.
+ *
+ * commitLsn is the LSN of the commit record.  This is currently never called
+ * for aborted transactions.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
+			 XLogRecPtr commitLsn)
+{
+	int			pageno;
+	int			i = 0;
+	int			offset = 0;
+
+	Assert(TransactionIdIsValid(xid));
+
+	pageno = TransactionIdToPage(xid);	/* get page of parent */
+	for (;;)
+	{
+		int			num_on_page = 0;
+
+		while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+		{
+			num_on_page++;
+			i++;
+		}
+
+		CSNLogSetPageStatus(xid,
+							num_on_page, subxids + offset,
+							commitLsn, pageno);
+		if (i >= nsubxids)
+			break;
+
+		offset = i;
+		pageno = TransactionIdToPage(subxids[offset]);
+		xid = InvalidTransactionId;
+	}
+}
+
+/*
+ * Record the final state of transaction entries in the CSN log for all
+ * entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as CSNLogSetCSN()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids, TransactionId *subxids,
+					XLogRecPtr commitLsn, int pageno)
+{
+	int			slotno;
+	int			i;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+	/* Subtransactions first, if needed ... */
+	for (i = 0; i < nsubxids; i++)
+	{
+		Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+		CSNLogSetCSNInSlot(subxids[i], commitLsn, slotno);
+	}
+
+	/* ... then the main transaction */
+	if (TransactionIdIsValid(xid))
+		CSNLogSetCSNInSlot(xid, commitLsn, slotno);
+
+	CsnlogCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XLogRecPtr csn, int slotno)
+{
+	int			entryno = TransactionIdToPgIndex(xid);
+	XLogRecPtr *ptr;
+
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+	*ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XLogRecPtr
+CSNLogGetCSNByXid(TransactionId xid)
+{
+	int			pageno = TransactionIdToPage(xid);
+	int			entryno = TransactionIdToPgIndex(xid);
+	int			slotno;
+	XLogRecPtr *ptr;
+	XLogRecPtr	xid_csn;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Can't ask about stuff that might not be around anymore */
+	Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+	ptr = (XLogRecPtr *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+	xid_csn = *ptr;
+
+	LWLockRelease(SimpleLruGetBankLock(CsnlogCtl, pageno));
+
+	return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+	return Min(32, Max(16, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+	/* FIXME: skip if not InHotStandby? */
+	return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+	CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+	SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+				  "pg_csn", LWTRANCHE_CSN_LOG_BUFFER,
+				  LWTRANCHE_CSN_LOG_SLRU, SYNC_HANDLER_NONE, false);
+	SlruPagePrecedesUnitTests(CsnlogCtl, CSN_LOG_XACTS_PER_PAGE);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * CSNLog segment.  The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ *
+ * Note: it's not really necessary to create the initial segment now,
+ * since slru.c would create it on first write anyway.  But we may as well
+ * do it to be sure the directory is set up correctly.
+ */
+void
+BootStrapCSNLog(void)
+{
+	int			slotno;
+	LWLock	   *lock;
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, 0);
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Create and zero the first page of the commit log */
+	slotno = ZeroCSNLogPage(0);
+
+	/* Make sure it's written out */
+	SimpleLruWritePage(CsnlogCtl, slotno);
+	Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+	return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * Initialize a page of CSNLog based on pg_xact.
+ *
+ * All committed transactions are stamped with 'csn'
+ */
+static void
+InitCSNLogPage(int pageno, TransactionId *xid, TransactionId nextXid, XLogRecPtr csn)
+{
+	XLogRecPtr	dummy;
+	int			slotno;
+
+	slotno = ZeroCSNLogPage(pageno);
+
+	while (*xid < nextXid && TransactionIdToPage(*xid) == pageno)
+	{
+		XidStatus	status = TransactionIdGetStatus(*xid, &dummy);
+
+		if (status == TRANSACTION_STATUS_COMMITTED ||
+			status == TRANSACTION_STATUS_ABORTED)
+			CSNLogSetCSNInSlot(*xid, csn, slotno);
+
+		TransactionIdAdvance(*xid);
+	}
+	SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid, and after
+ * initializing the CLOG.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ *
+ * All transactions that have already completed are marked with 'csn'. ('csn'
+ * is supposed to be an "older than anything we'll ever need to compare with")
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn)
+{
+	TransactionId xid;
+	FullTransactionId nextXid;
+	int			startPage;
+	int			endPage;
+	LWLock	   *prevlock = NULL;
+	LWLock	   *lock;
+
+	/*
+	 * Since we don't expect pg_csn to be valid across crashes, we initialize
+	 * the currently-active page(s) to zeroes during startup. Whenever we
+	 * advance into a new page, ExtendCSNLog will likewise zero the new page
+	 * without regard to whatever was previously on disk.
+	 */
+	startPage = TransactionIdToPage(oldestActiveXID);
+	nextXid = TransamVariables->nextXid;
+	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+
+	Assert(TransactionIdIsValid(oldestActiveXID));
+	Assert(FullTransactionIdIsValid(nextXid));
+
+	xid = oldestActiveXID;
+	for (;;)
+	{
+		lock = SimpleLruGetBankLock(CsnlogCtl, startPage);
+		if (prevlock != lock)
+		{
+			if (prevlock)
+				LWLockRelease(prevlock);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			prevlock = lock;
+		}
+
+		InitCSNLogPage(startPage, &xid, XidFromFullTransactionId(nextXid), csn);
+		if (startPage == endPage)
+			break;
+
+		startPage++;
+		/* must account for wraparound */
+		if (startPage > TransactionIdToPage(MaxTransactionId))
+			startPage = 0;
+	}
+
+	LWLockRelease(lock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely as a debugging aid.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+	SimpleLruWriteAll(CsnlogCtl, false);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+	/*
+	 * Flush dirty CSNLog pages to disk.
+	 *
+	 * This is not actually necessary from a correctness point of view. We do
+	 * it merely to improve the odds that writing of dirty pages is done by
+	 * the checkpoint process and not by backends.
+	 */
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+	SimpleLruWriteAll(CsnlogCtl, true);
+	TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+	int64		pageno;
+	LWLock	   *lock;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToPgIndex(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToPage(newestXact);
+
+	lock = SimpleLruGetBankLock(CsnlogCtl, pageno);
+
+	LWLockAcquire(lock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCSNLogPage(pageno);
+
+	LWLockRelease(lock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+	 * back one transaction to avoid passing a cutoff page that hasn't been
+	 * created yet in the rare case that oldestXact would be the first item on
+	 * a page and oldestXact == next XID.  In that case, if we didn't subtract
+	 * one, we'd trigger SimpleLruTruncate's wraparound detection.
+	 */
+	TransactionIdRetreat(oldestXact);
+	cutoffPage = TransactionIdToPage(oldestXact);
+
+	SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation purposes.
+ * Analogous to CLOGPagePrecedes() and SubTransPagePrecedes().
+ */
+static bool
+CSNLogPagePrecedes(int64 page1, int64 page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId + 1;
+	xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId + 1;
+
+	return (TransactionIdPrecedes(xid1, xid2) &&
+			TransactionIdPrecedes(xid1, xid2 + CSN_LOG_XACTS_PER_PAGE - 1));
+}
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..e2a3419fc22 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -2,6 +2,7 @@
 
 backend_sources += files(
   'clog.c',
+  'csn_log.c',
   'commit_ts.c',
   'generic_xlog.c',
   'multixact.c',
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index 9a39451a29a..b4c42c0f156 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -377,6 +377,9 @@ TransactionIdLatest(TransactionId mainxid,
  * Also, because we group transactions on the same clog page to conserve
  * storage, we might return the LSN of a later transaction that falls into
  * the same group.
+ *
+ * XXX: Now that we have the CSN-log, should we use that during recovery? Or
+ * rename this function to reduce confusion.
  */
 XLogRecPtr
 TransactionIdGetCommitLSN(TransactionId xid)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194..2330632e569 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1943,20 +1944,13 @@ restoreTwoPhaseData(void)
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return TransamVariables->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
- *
- * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
- * top-level xids is stored in *xids_p. The number of entries in the array
- * is returned in *nxids_p.
  */
 TransactionId
-PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
+PrescanPreparedTransactions(void)
 {
 	FullTransactionId nextXid = TransamVariables->nextXid;
 	TransactionId origNextXid = XidFromFullTransactionId(nextXid);
 	TransactionId result = origNextXid;
-	TransactionId *xids = NULL;
-	int			nxids = 0;
-	int			allocsize = 0;
 	int			i;
 
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
@@ -1984,34 +1978,10 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		if (TransactionIdPrecedes(xid, result))
 			result = xid;
 
-		if (xids_p)
-		{
-			if (nxids == allocsize)
-			{
-				if (nxids == 0)
-				{
-					allocsize = 10;
-					xids = palloc(allocsize * sizeof(TransactionId));
-				}
-				else
-				{
-					allocsize = allocsize * 2;
-					xids = repalloc(xids, allocsize * sizeof(TransactionId));
-				}
-			}
-			xids[nxids++] = xid;
-		}
-
 		pfree(buf);
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
-	if (xids_p)
-	{
-		*xids_p = xids;
-		*nxids_p = nxids;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fe895787cb7..a495f1d7899 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..5250a158145 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -210,7 +211,6 @@ typedef struct TransactionStateData
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;	/* entry-time xact r/o state */
 	bool		startedInRecovery;	/* did we start in recovery? */
-	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
@@ -250,13 +250,6 @@ static TransactionStateData TopTransactionStateData = {
 	.topXidLogged = false,
 };
 
-/*
- * unreportedXids holds XIDs of all subtransactions that have not yet been
- * reported in an XLOG_XACT_ASSIGNMENT record.
- */
-static int	nUnreportedXids;
-static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
-
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
 
 /*
@@ -532,18 +525,6 @@ GetCurrentFullTransactionIdIfAny(void)
 	return CurrentTransactionState->fullTransactionId;
 }
 
-/*
- *	MarkCurrentTransactionIdLoggedIfAny
- *
- * Remember that the current xid - if it is assigned - now has been wal logged.
- */
-void
-MarkCurrentTransactionIdLoggedIfAny(void)
-{
-	if (FullTransactionIdIsValid(CurrentTransactionState->fullTransactionId))
-		CurrentTransactionState->didLogXid = true;
-}
-
 /*
  * IsSubxactTopXidLogPending
  *
@@ -636,7 +617,6 @@ AssignTransactionId(TransactionState s)
 {
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;
-	bool		log_unknown_top = false;
 
 	/* Assert that caller didn't screw up */
 	Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -680,20 +660,6 @@ AssignTransactionId(TransactionState s)
 		pfree(parents);
 	}
 
-	/*
-	 * When wal_level=logical, guarantee that a subtransaction's xid can only
-	 * be seen in the WAL stream if its toplevel xid has been logged before.
-	 * If necessary we log an xact_assignment record with fewer than
-	 * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
-	 * for a transaction even though it appears in a WAL record, we just might
-	 * superfluously log something. That can happen when an xid is included
-	 * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
-	 * xl_standby_locks.
-	 */
-	if (isSubXact && XLogLogicalInfoActive() &&
-		!TopTransactionStateData.didLogXid)
-		log_unknown_top = true;
-
 	/*
 	 * Generate a new FullTransactionId and record its xid in PGPROC and
 	 * pg_subtrans.
@@ -729,59 +695,6 @@ AssignTransactionId(TransactionState s)
 	XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
 
 	CurrentResourceOwner = currentOwner;
-
-	/*
-	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
-	 * top-level transaction we issue a WAL record for the assignment. We
-	 * include the top-level xid and all the subxids that have not yet been
-	 * reported using XLOG_XACT_ASSIGNMENT records.
-	 *
-	 * This is required to limit the amount of shared memory required in a hot
-	 * standby server to keep track of in-progress XIDs. See notes for
-	 * RecordKnownAssignedTransactionIds().
-	 *
-	 * We don't keep track of the immediate parent of each subxid, only the
-	 * top-level transaction that each subxact belongs to. This is correct in
-	 * recovery only because aborted subtransactions are separately WAL
-	 * logged.
-	 *
-	 * This is correct even for the case where several levels above us didn't
-	 * have an xid assigned as we recursed up to them beforehand.
-	 */
-	if (isSubXact && XLogStandbyInfoActive())
-	{
-		unreportedXids[nUnreportedXids] = XidFromFullTransactionId(s->fullTransactionId);
-		nUnreportedXids++;
-
-		/*
-		 * ensure this test matches similar one in
-		 * RecoverPreparedTransactions()
-		 */
-		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS ||
-			log_unknown_top)
-		{
-			xl_xact_assignment xlrec;
-
-			/*
-			 * xtop is always set by now because we recurse up transaction
-			 * stack to the highest unassigned xid and then come back down
-			 */
-			xlrec.xtop = GetTopTransactionId();
-			Assert(TransactionIdIsValid(xlrec.xtop));
-			xlrec.nsubxacts = nUnreportedXids;
-
-			XLogBeginInsert();
-			XLogRegisterData(&xlrec, MinSizeOfXactAssignment);
-			XLogRegisterData(unreportedXids,
-							 nUnreportedXids * sizeof(TransactionId));
-
-			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT);
-
-			nUnreportedXids = 0;
-			/* mark top, not current xact as having been logged */
-			TopTransactionStateData.didLogXid = true;
-		}
-	}
 }
 
 /*
@@ -1481,11 +1394,11 @@ RecordTransactionCommit(void)
 	 * temp tables will be lost anyway, unlogged tables will be truncated and
 	 * HOT pruning will be done again later. (Given the foregoing, you might
 	 * think that it would be unnecessary to emit the XLOG record at all in
-	 * this case, but we don't currently try to do that.  It would certainly
-	 * cause problems at least in Hot Standby mode, where the
-	 * KnownAssignedXids machinery requires tracking every XID assignment.  It
-	 * might be OK to skip it only when wal_level < replica, but for now we
-	 * don't.)
+	 * this case, but we don't currently try to do that.  It might cause
+	 * inefficiencies in Hot Standby mode, if nothing else, where the
+	 * commit/abort records allow advancing the xmin horizon for new
+	 * snapshots. It might be OK to skip it only when wal_level < replica, but
+	 * for now we don't.)
 	 *
 	 * However, if we're doing cleanup of any non-temp rels or committing any
 	 * command that wanted to force sync commit, then we must flush XLOG
@@ -1953,13 +1866,6 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
-
-	/*
-	 * We could prune the unreportedXids array here. But we don't bother. That
-	 * would potentially reduce number of XLOG_XACT_ASSIGNMENT records but it
-	 * would likely introduce more CPU time into the more common paths, so we
-	 * choose not to do that.
-	 */
 }
 
 /* ----------------------------------------------------------------
@@ -2142,12 +2048,6 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;
 
-	/*
-	 * initialize reported xid accounting
-	 */
-	nUnreportedXids = 0;
-	s->didLogXid = false;
-
 	/*
 	 * must initialize resource-management stuff first
 	 */
@@ -6154,7 +6054,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	TransactionTreeSetCommitTsData(xid, parsed->nsubxacts, parsed->subxacts,
 								   commit_time, origin_id);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/*
 		 * Mark the transaction committed in pg_xact.
@@ -6174,6 +6074,12 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/*
+		 * Mark the CSNLOG first.  The transaction won't become visible to new
+		 * snapshots until the call to ProcArrayRecoveryEndTransaction().
+		 */
+		CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
+
 		/*
 		 * Mark the transaction committed in pg_xact. We use async commit
 		 * protocol during recovery to provide information on database
@@ -6186,9 +6092,9 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);
 
 		/*
-		 * We must mark clog before we update the ProcArray.
+		 * Make the commit visible to new snapshots in the ProcArray.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * Send any cache invalidations attached to the commit. We must
@@ -6294,7 +6200,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 								  parsed->subxacts);
 	AdvanceNextFullTransactionIdPastXid(max_xid);
 
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 	{
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
@@ -6312,13 +6218,15 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 		 */
 		RecordKnownAssignedTransactionIds(max_xid);
 
+		/* Note: we don't need to update the CSN log on abort. */
+
 		/* Mark the transaction aborted in pg_xact, no need for async stuff */
 		TransactionIdAbortTree(xid, parsed->nsubxacts, parsed->subxacts);
 
 		/*
 		 * We must update the ProcArray after we have marked clog.
 		 */
-		ExpireTreeKnownAssignedTransactionIds(xid, parsed->nsubxacts, parsed->subxacts, max_xid);
+		ProcArrayRecoveryEndTransaction(max_xid, lsn);
 
 		/*
 		 * There are no invalidation messages to send or undo.
@@ -6426,14 +6334,6 @@ xact_redo(XLogReaderState *record)
 					   XLogRecGetOrigin(record));
 		LWLockRelease(TwoPhaseStateLock);
 	}
-	else if (info == XLOG_XACT_ASSIGNMENT)
-	{
-		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
-
-		if (standbyState >= STANDBY_INITIALIZED)
-			ProcArrayApplyXidAssignment(xlrec->xtop,
-										xlrec->nsubxacts, xlrec->xsub);
-	}
 	else if (info == XLOG_XACT_INVALIDATIONS)
 	{
 		/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc30a52d496..cbeac223e1c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -951,8 +952,6 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
-
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
@@ -5230,6 +5229,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCSNLog();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5831,16 +5831,16 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
+			FullTransactionId latestCompletedXid;
 
 			ereport(DEBUG1,
 					(errmsg_internal("initializing for hot standby")));
+			InHotStandby = true;
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
-				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanPreparedTransactions();
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -5855,39 +5855,17 @@ StartupXLOG(void)
 			 */
 			StartupSUBTRANS(oldestActiveXID);
 
-			/*
-			 * If we're beginning at a shutdown checkpoint, we know that
-			 * nothing was running on the primary at this point. So fake-up an
-			 * empty running-xacts record and use that here and now. Recover
-			 * additional standby state for prepared transactions.
-			 */
-			if (wasShutdown)
-			{
-				RunningTransactionsData running;
-				TransactionId latestCompletedXid;
+			latestCompletedXid = checkPoint.nextXid;
+			FullTransactionIdRetreat(&latestCompletedXid);
+			TransamVariables->latestCompletedXid = latestCompletedXid;
 
-				/* Update pg_subtrans entries for any prepared transactions */
-				StandbyRecoverPreparedTransactions();
+			StartupCSNLog(oldestActiveXID, RedoRecPtr);
 
-				/*
-				 * Construct a RunningTransactions snapshot representing a
-				 * shut down server, with only prepared transactions still
-				 * alive. We're never overflowed at this point because all
-				 * subxids are listed with their parent prepared transactions.
-				 */
-				running.xcnt = nxids;
-				running.subxcnt = 0;
-				running.subxid_status = SUBXIDS_IN_SUBTRANS;
-				running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-				running.oldestRunningXid = oldestActiveXID;
-				latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-				TransactionIdRetreat(latestCompletedXid);
-				Assert(TransactionIdIsNormal(latestCompletedXid));
-				running.latestCompletedXid = latestCompletedXid;
-				running.xids = xids;
-
-				ProcArrayApplyRecoveryInfo(&running);
-			}
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
+
+			/* Update pg_subtrans entries for any prepared transactions */
+			if (wasShutdown)
+				StandbyRecoverPreparedTransactions();
 		}
 
 		/*
@@ -5971,7 +5949,7 @@ StartupXLOG(void)
 	 * This information is not quite needed yet, but it is positioned here so
 	 * as potential problems are detected before any on-disk change is done.
 	 */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanPreparedTransactions();
 
 	/*
 	 * Allow ordinary WAL segment creation before possibly switching to a new
@@ -6137,9 +6115,18 @@ StartupXLOG(void)
 	 * Start up subtrans, if not already done for hot standby.  (commit
 	 * timestamps are started below, if necessary.)
 	 */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
+	{
 		StartupSUBTRANS(oldestActiveXID);
 
+		/*
+		 * TODO: we don't need to update CSN log from now on, but it's still
+		 * required by snapshots that were taken before recovery ended.  We
+		 * just let it be, but it would be nice to truncate it to 0 after all
+		 * the snapshots are gone.
+		 */
+	}
+
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
@@ -6225,12 +6212,12 @@ StartupXLOG(void)
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
 	 * and after switching SharedRecoveryState to RECOVERY_STATE_DONE so as
-	 * any session building a snapshot will not rely on KnownAssignedXids as
+	 * any session building a snapshot will not rely on the CSN log as
 	 * RecoveryInProgress() would return false at this stage.  This is
 	 * particularly critical for prepared 2PC transactions, that would still
 	 * need to be included in snapshots once recovery has ended.
 	 */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/*
@@ -7002,7 +6989,7 @@ CreateCheckPoint(int flags)
 	 * starting snapshot of locks and transactions.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
+		checkPoint.oldestActiveXid = GetOldestActiveTransactionId(true);
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -7396,6 +7383,9 @@ CreateCheckPoint(int flags)
 	if (!RecoveryInProgress())
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
+	if (shutdown)
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(false);
 
@@ -7567,6 +7557,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
 	CheckPointCLOG();
+	CheckPointCSNLog();
 	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
@@ -7863,7 +7854,10 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
+	{
 		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
+		TruncateCSNLog(GetOldestTransactionIdConsideredRunning());
+	}
 
 	/* Real work is done; log and update stats. */
 	LogCheckpointEnd(true);
@@ -8348,41 +8342,17 @@ xlog_redo(XLogReaderState *record)
 
 		/*
 		 * If we see a shutdown checkpoint, we know that nothing was running
-		 * on the primary at this point. So fake-up an empty running-xacts
-		 * record and use that here and now. Recover additional standby state
-		 * for prepared transactions.
+		 * on the primary at this point, except for prepared transactions.
 		 */
-		if (standbyState >= STANDBY_INITIALIZED)
+		if (InHotStandby)
 		{
-			TransactionId *xids;
-			int			nxids;
 			TransactionId oldestActiveXID;
-			TransactionId latestCompletedXid;
-			RunningTransactionsData running;
 
-			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanPreparedTransactions();
+			ProcArrayUpdateOldestRunningXid(oldestActiveXID);
 
 			/* Update pg_subtrans entries for any prepared transactions */
 			StandbyRecoverPreparedTransactions();
-
-			/*
-			 * Construct a RunningTransactions snapshot representing a shut
-			 * down server, with only prepared transactions still alive. We're
-			 * never overflowed at this point because all subxids are listed
-			 * with their parent prepared transactions.
-			 */
-			running.xcnt = nxids;
-			running.subxcnt = 0;
-			running.subxid_status = SUBXIDS_IN_SUBTRANS;
-			running.nextXid = XidFromFullTransactionId(checkPoint.nextXid);
-			running.oldestRunningXid = oldestActiveXID;
-			latestCompletedXid = XidFromFullTransactionId(checkPoint.nextXid);
-			TransactionIdRetreat(latestCompletedXid);
-			Assert(TransactionIdIsNormal(latestCompletedXid));
-			running.latestCompletedXid = latestCompletedXid;
-			running.xids = xids;
-
-			ProcArrayApplyRecoveryInfo(&running);
 		}
 
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
@@ -8446,6 +8416,16 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
+
+		/*
+		 * Remember the oldest XID that was running at the time.  Normally,
+		 * all transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		if (InHotStandby)
+			ProcArrayUpdateOldestRunningXid(checkPoint.oldestActiveXid);
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0aa3ab59085..b213b8a74dc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1978,10 +1978,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	SpinLockRelease(&XLogRecoveryCtl->info_lck);
 
 	/*
-	 * If we are attempting to enter Hot Standby mode, process XIDs we see
+	 * In Hot Standby mode, process XIDs we see
 	 */
-	if (standbyState >= STANDBY_INITIALIZED &&
-		TransactionIdIsValid(record->xl_xid))
+	if (InHotStandby && TransactionIdIsValid(record->xl_xid))
 		RecordKnownAssignedTransactionIds(record->xl_xid);
 
 	/*
@@ -2258,7 +2257,7 @@ CheckRecoveryConsistency(void)
 	 * run? If so, we can tell postmaster that the database is consistent now,
 	 * enabling connections.
 	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY &&
+	if (InHotStandby &&
 		!LocalHotStandbyActive &&
 		reachedConsistency &&
 		IsUnderPostmaster)
@@ -3715,9 +3714,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						elog(LOG, "waiting for WAL to become available at %X/%X",
 							 LSN_FORMAT_ARGS(RecPtr));
 
-						/* Do background tasks that might benefit us later. */
-						KnownAssignedTransactionIdsIdleMaintenance();
-
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3983,9 +3979,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						streaming_reply_sent = true;
 					}
 
-					/* Do any background tasks that might benefit us later. */
-					KnownAssignedTransactionIdsIdleMaintenance();
-
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index c389b27f77d..775e1a926d8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -50,7 +50,7 @@ bool		ignore_invalid_pages = false;
 bool		InRecovery = false;
 
 /* Are we in Hot Standby mode? Only valid in startup process, see xlogutils.h */
-HotStandbyState standbyState = STANDBY_DISABLED;
+bool		InHotStandby = false;
 
 /*
  * During XLOG replay, we may see XLOG records for incremental updates of
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 891637e3a44..f1307ed714c 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -181,6 +181,9 @@ static const char *const excludeDirContents[] =
 	/* Contents zeroed on startup, see StartupSUBTRANS(). */
 	"pg_subtrans",
 
+	/* Contents zeroed on startup, see StartupCSNLog(). */
+	"pg_csn",
+
 	/* end of list */
 	NULL
 };
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 27e86cf393f..d04286ab270 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -203,7 +203,7 @@ static void
 StartupProcExit(int code, Datum arg)
 {
 	/* Shutdown the recovery environment */
-	if (standbyState != STANDBY_DISABLED)
+	if (InHotStandby)
 		ShutdownRecoveryTransactionEnvironment();
 }
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6a428e9720e..808b1d85379 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -270,14 +270,6 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
-		case XLOG_XACT_ASSIGNMENT:
-
-			/*
-			 * We assign subxact to the toplevel xact while processing each
-			 * record if required.  So, we don't need to do anything here. See
-			 * LogicalDecodingProcessRecord.
-			 */
-			break;
 		case XLOG_XACT_INVALIDATIONS:
 			{
 				TransactionId xid;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3c94a62cdf6..97d278052df 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..fc9804b2eab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/csn_log.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
@@ -122,6 +123,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
+	size = add_size(size, CSNLogShmemSize());
 	size = add_size(size, CommitTsShmemSize());
 	size = add_size(size, SUBTRANSShmemSize());
 	size = add_size(size, TwoPhaseShmemSize());
@@ -287,6 +289,7 @@ CreateOrAttachShmemStructs(void)
 	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
+	CSNLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 819649741f6..3418ddf5304 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -19,20 +19,10 @@
  * myProcLocks lists.  They can be distinguished from regular backend PGPROCs
  * at need by checking for pid == 0.
  *
- * During hot standby, we also keep a list of XIDs representing transactions
- * that are known to be running on the primary (or more precisely, were running
- * as of the current point in the WAL stream).  This list is kept in the
- * KnownAssignedXids array, and is updated by watching the sequence of
- * arriving XIDs.  This is necessary because if we leave those XIDs out of
- * snapshots taken for standby queries, then they will appear to be already
- * complete, leading to MVCC failures.  Note that in hot standby, the PGPROC
- * array represents standby processes, which by definition are not running
- * transactions that have XIDs.
- *
- * It is perhaps possible for a backend on the primary to terminate without
- * writing an abort record for its transaction.  While that shouldn't really
- * happen, it would tie up KnownAssignedXids indefinitely, so we protect
- * ourselves by pruning the array when a valid list of running XIDs arrives.
+ * During hot standby, we don't have PGPROC entries representing transactions
+ * running in the primary.  In snapshots taken during recovery, the snapshot
+ * contains a Commit-Sequence Number (CSN) which is used to determine which
+ * XIDs are still considered as running by the snapshot.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -47,6 +37,7 @@
 
 #include <signal.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -74,22 +65,8 @@ typedef struct ProcArrayStruct
 	int			numProcs;		/* number of valid procs entries */
 	int			maxProcs;		/* allocated size of procs array */
 
-	/*
-	 * Known assigned XIDs handling
-	 */
-	int			maxKnownAssignedXids;	/* allocated size of array */
-	int			numKnownAssignedXids;	/* current # of valid entries */
-	int			tailKnownAssignedXids;	/* index of oldest valid element */
-	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-
-	/*
-	 * Highest subxid that has been removed from KnownAssignedXids array to
-	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGPROC
-	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
-	 * lock to read it.
-	 */
-	TransactionId lastOverflowedXid;
+	/* In recovery, oldest XID that could be still running in primary */
+	TransactionId oldest_running_primary_xid;
 
 	/* oldest xmin of any replication slot */
 	TransactionId replication_slot_xmin;
@@ -100,6 +77,21 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+#define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+
+/*
+ * TOTAL_MAX_CACHED_SUBXIDS is the total number of XIDs that fits in the proc
+ * array, as top XIDs and in the subxids caches.
+ *
+ * Local data structures are also created in various backends during
+ * GetSnapshotData(), TransactionIdIsInProgress() and
+ * GetRunningTransactionData(). All of the main structures created in those
+ * functions must be identically sized, since we may at times copy the whole
+ * of the data structures around.
+ */
+#define TOTAL_MAX_CACHED_SUBXIDS \
+	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+
 /*
  * State for the GlobalVisTest* family of functions. Those functions can
  * e.g. be used to decide if a deleted row can be removed without violating
@@ -255,17 +247,6 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
-/*
- * Reason codes for KnownAssignedXidsCompress().
- */
-typedef enum KAXCompressReason
-{
-	KAX_NO_SPACE,				/* need to free up space at array end */
-	KAX_PRUNE,					/* we just pruned old entries */
-	KAX_TRANSACTION_END,		/* we just committed/removed some XIDs */
-	KAX_STARTUP_PROCESS_IDLE,	/* startup process is about to sleep */
-} KAXCompressReason;
-
 
 static ProcArrayStruct *procArray;
 
@@ -277,19 +258,10 @@ static PGPROC *allProcs;
 static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
 
 /*
- * Bookkeeping for tracking emulated transactions in recovery
+ * Bookkeeping for tracking transactions seen during recovery
  */
-static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
-/*
- * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
- * the highest xid that might still be running that we don't have in
- * KnownAssignedXids.
- */
-static TransactionId standbySnapshotPendingXmin;
-
 /*
  * State for visibility checks on different types of relations. See struct
  * GlobalVisState for details. As shared, catalog, normal and temporary
@@ -316,7 +288,7 @@ static long xc_by_my_xact = 0;
 static long xc_by_latest_xid = 0;
 static long xc_by_main_xid = 0;
 static long xc_by_child_xid = 0;
-static long xc_by_known_assigned = 0;
+static long xc_during_recovery = 0;
 static long xc_no_overflow = 0;
 static long xc_slow_answer = 0;
 
@@ -326,7 +298,7 @@ static long xc_slow_answer = 0;
 #define xc_by_latest_xid_inc()		(xc_by_latest_xid++)
 #define xc_by_main_xid_inc()		(xc_by_main_xid++)
 #define xc_by_child_xid_inc()		(xc_by_child_xid++)
-#define xc_by_known_assigned_inc()	(xc_by_known_assigned++)
+#define xc_during_recovery_inc()	(xc_during_recovery++)
 #define xc_no_overflow_inc()		(xc_no_overflow++)
 #define xc_slow_answer_inc()		(xc_slow_answer++)
 
@@ -339,28 +311,12 @@ static void DisplayXidCache(void);
 #define xc_by_latest_xid_inc()		((void) 0)
 #define xc_by_main_xid_inc()		((void) 0)
 #define xc_by_child_xid_inc()		((void) 0)
-#define xc_by_known_assigned_inc()	((void) 0)
+#define xc_during_recovery_inc()	((void) 0)
 #define xc_no_overflow_inc()		((void) 0)
 #define xc_slow_answer_inc()		((void) 0)
 #endif							/* XIDCACHE_DEBUG */
 
-/* Primitives for KnownAssignedXids array handling for standby */
-static void KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock);
-static void KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-								 bool exclusive_lock);
-static bool KnownAssignedXidsSearch(TransactionId xid, bool remove);
-static bool KnownAssignedXidExists(TransactionId xid);
-static void KnownAssignedXidsRemove(TransactionId xid);
-static void KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-										TransactionId *subxids);
-static void KnownAssignedXidsRemovePreceding(TransactionId removeXid);
-static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
-static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
-										   TransactionId *xmin,
-										   TransactionId xmax);
-static TransactionId KnownAssignedXidsGetOldestXmin(void);
-static void KnownAssignedXidsDisplay(int trace_level);
-static void KnownAssignedXidsReset(void);
+
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
@@ -384,31 +340,6 @@ ProcArrayShmemSize(void)
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
-	/*
-	 * During Hot Standby processing we have a data structure called
-	 * KnownAssignedXids, created in shared memory. Local data structures are
-	 * also created in various backends during GetMVCCSnapshotData(),
-	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
-	 * main structures created in those functions must be identically sized,
-	 * since we may at times copy the whole of the data structures around. We
-	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
-	 *
-	 * Ideally we'd only create this structure if we were actually doing hot
-	 * standby in the current run, but we don't know that yet at the time
-	 * shared memory is being set up.
-	 */
-#define TOTAL_MAX_CACHED_SUBXIDS \
-	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
-
-	if (EnableHotStandby)
-	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
-	}
-
 	return size;
 }
 
@@ -435,31 +366,12 @@ ProcArrayShmemInit(void)
 		 */
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
-		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
-		procArray->numKnownAssignedXids = 0;
-		procArray->tailKnownAssignedXids = 0;
-		procArray->headKnownAssignedXids = 0;
-		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
 		TransamVariables->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
-
-	/* Create or attach to the KnownAssignedXids arrays too, if needed */
-	if (EnableHotStandby)
-	{
-		KnownAssignedXids = (TransactionId *)
-			ShmemInitStruct("KnownAssignedXids",
-							mul_size(sizeof(TransactionId),
-									 TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-							mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
-							&found);
-	}
 }
 
 /*
@@ -1023,355 +935,35 @@ MaintainLatestCompletedXidRecovery(TransactionId latestXid)
 void
 ProcArrayInitRecovery(TransactionId initializedUptoXID)
 {
-	Assert(standbyState == STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsNormal(initializedUptoXID));
 
 	/*
-	 * we set latestObservedXid to the xid SUBTRANS has been initialized up
-	 * to, so we can extend it from that point onwards in
-	 * RecordKnownAssignedTransactionIds, and when we get consistent in
-	 * ProcArrayApplyRecoveryInfo().
+	 * we set latestObservedXid to the xid SUBTRANS and CSN log have been
+	 * initialized up to, so we can extend it from that point onwards whenever
+	 * we observe new XIDs.
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
 }
 
 /*
- * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
- *
- * Takes us through 3 states: Initialized, Pending and Ready.
- * Normal case is to go all the way to Ready straight away, though there
- * are atypical cases where we need to take it in steps.
- *
- * Use the data about running transactions on the primary to create the initial
- * state of KnownAssignedXids. We also use these records to regularly prune
- * KnownAssignedXids because we know it is possible that some transactions
- * with FATAL errors fail to write abort records, which could cause eventual
- * overflow.
- *
- * See comments for LogStandbySnapshot().
+ * Update oldest running XID. from a checkpoint record. This allows truncating
+ * SUBTRANS and the CSN log.
  */
 void
-ProcArrayApplyRecoveryInfo(RunningTransactions running)
+ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 {
-	TransactionId *xids;
-	TransactionId advanceNextXid;
-	int			nxids;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-	Assert(TransactionIdIsValid(running->nextXid));
-	Assert(TransactionIdIsValid(running->oldestRunningXid));
-	Assert(TransactionIdIsNormal(running->latestCompletedXid));
-
-	/*
-	 * Remove stale transactions, if any.
-	 */
-	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
-
-	/*
-	 * Adjust TransamVariables->nextXid before StandbyReleaseOldLocks(),
-	 * because we will need it up to date for accessing two-phase transactions
-	 * in StandbyReleaseOldLocks().
-	 */
-	advanceNextXid = running->nextXid;
-	TransactionIdRetreat(advanceNextXid);
-	AdvanceNextFullTransactionIdPastXid(advanceNextXid);
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-
 	/*
 	 * Remove stale locks, if any.
 	 */
-	StandbyReleaseOldLocks(running->oldestRunningXid);
-
-	/*
-	 * If our snapshot is already valid, nothing else to do...
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		return;
-
-	/*
-	 * If our initial RunningTransactionsData had an overflowed snapshot then
-	 * we knew we were missing some subxids from our snapshot. If we continue
-	 * to see overflowed snapshots then we might never be able to start up, so
-	 * we make another test to see if our snapshot is now valid. We know that
-	 * the missing subxids are equal to or earlier than nextXid. After we
-	 * initialise we continue to apply changes during recovery, so once the
-	 * oldestRunningXid is later than the nextXid from the initial snapshot we
-	 * know that we no longer have missing information and can mark the
-	 * snapshot as valid.
-	 */
-	if (standbyState == STANDBY_SNAPSHOT_PENDING)
-	{
-		/*
-		 * If the snapshot isn't overflowed or if its empty we can reset our
-		 * pending state and use this snapshot instead.
-		 */
-		if (running->subxid_status != SUBXIDS_MISSING || running->xcnt == 0)
-		{
-			/*
-			 * If we have already collected known assigned xids, we need to
-			 * throw them away before we apply the recovery snapshot.
-			 */
-			KnownAssignedXidsReset();
-			standbyState = STANDBY_INITIALIZED;
-		}
-		else
-		{
-			if (TransactionIdPrecedes(standbySnapshotPendingXmin,
-									  running->oldestRunningXid))
-			{
-				standbyState = STANDBY_SNAPSHOT_READY;
-				elog(DEBUG1,
-					 "recovery snapshots are now enabled");
-			}
-			else
-				elog(DEBUG1,
-					 "recovery snapshot waiting for non-overflowed snapshot or "
-					 "until oldest active xid on standby is at least %u (now %u)",
-					 standbySnapshotPendingXmin,
-					 running->oldestRunningXid);
-			return;
-		}
-	}
-
-	Assert(standbyState == STANDBY_INITIALIZED);
-
-	/*
-	 * NB: this can be reached at least twice, so make sure new code can deal
-	 * with that.
-	 */
+	StandbyReleaseOldLocks(oldestRunningXID);
 
-	/*
-	 * Nobody else is running yet, but take locks anyhow
-	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * KnownAssignedXids is sorted so we cannot just add the xids, we have to
-	 * sort them first.
-	 *
-	 * Some of the new xids are top-level xids and some are subtransactions.
-	 * We don't call SubTransSetParent because it doesn't matter yet. If we
-	 * aren't overflowed then all xids will fit in snapshot and so we don't
-	 * need subtrans. If we later overflow, an xid assignment record will add
-	 * xids to subtrans. If RunningTransactionsData is overflowed then we
-	 * don't have enough information to correctly update subtrans anyway.
-	 */
-
-	/*
-	 * Allocate a temporary array to avoid modifying the array passed as
-	 * argument.
-	 */
-	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
-
-	/*
-	 * Add to the temp array any xids which have not already completed.
-	 */
-	nxids = 0;
-	for (i = 0; i < running->xcnt + running->subxcnt; i++)
-	{
-		TransactionId xid = running->xids[i];
-
-		/*
-		 * The running-xacts snapshot can contain xids that were still visible
-		 * in the procarray when the snapshot was taken, but were already
-		 * WAL-logged as completed. They're not running anymore, so ignore
-		 * them.
-		 */
-		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
-			continue;
-
-		xids[nxids++] = xid;
-	}
-
-	if (nxids > 0)
-	{
-		if (procArray->numKnownAssignedXids != 0)
-		{
-			LWLockRelease(ProcArrayLock);
-			elog(ERROR, "KnownAssignedXids is not empty");
-		}
-
-		/*
-		 * Sort the array so that we can add them safely into
-		 * KnownAssignedXids.
-		 *
-		 * We have to sort them logically, because in KnownAssignedXidsAdd we
-		 * call TransactionIdFollowsOrEquals and so on. But we know these XIDs
-		 * come from RUNNING_XACTS, which means there are only normal XIDs
-		 * from the same epoch, so this is safe.
-		 */
-		qsort(xids, nxids, sizeof(TransactionId), xidLogicalComparator);
-
-		/*
-		 * Add the sorted snapshot into KnownAssignedXids.  The running-xacts
-		 * snapshot may include duplicated xids because of prepared
-		 * transactions, so ignore them.
-		 */
-		for (i = 0; i < nxids; i++)
-		{
-			if (i > 0 && TransactionIdEquals(xids[i - 1], xids[i]))
-			{
-				elog(DEBUG1,
-					 "found duplicated transaction %u for KnownAssignedXids insertion",
-					 xids[i]);
-				continue;
-			}
-			KnownAssignedXidsAdd(xids[i], xids[i], true);
-		}
-
-		KnownAssignedXidsDisplay(DEBUG3);
-	}
-
-	pfree(xids);
-
-	/*
-	 * latestObservedXid is at least set to the point where SUBTRANS was
-	 * started up to (cf. ProcArrayInitRecovery()) or to the biggest xid
-	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
-	 * subtrans from thereon, up to nextXid - 1.
-	 *
-	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
-	 * because we've just added xids to the known assigned xids machinery that
-	 * haven't gone through RecordKnownAssignedTransactionId().
-	 */
-	Assert(TransactionIdIsNormal(latestObservedXid));
-	TransactionIdAdvance(latestObservedXid);
-	while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
-	{
-		ExtendSUBTRANS(latestObservedXid);
-		TransactionIdAdvance(latestObservedXid);
-	}
-	TransactionIdRetreat(latestObservedXid);	/* = running->nextXid - 1 */
-
-	/* ----------
-	 * Now we've got the running xids we need to set the global values that
-	 * are used to track snapshots as they evolve further.
-	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
-	 * - lastOverflowedXid which shows whether snapshots overflow
-	 * - nextXid
-	 *
-	 * If the snapshot overflowed, then we still initialise with what we know,
-	 * but the recovery snapshot isn't fully valid yet because we know there
-	 * are some subxids missing. We don't know the specific subxids that are
-	 * missing, so conservatively assume the last one is latestObservedXid.
-	 * ----------
-	 */
-	if (running->subxid_status == SUBXIDS_MISSING)
-	{
-		standbyState = STANDBY_SNAPSHOT_PENDING;
-
-		standbySnapshotPendingXmin = latestObservedXid;
-		procArray->lastOverflowedXid = latestObservedXid;
-	}
-	else
-	{
-		standbyState = STANDBY_SNAPSHOT_READY;
-
-		standbySnapshotPendingXmin = InvalidTransactionId;
-
-		/*
-		 * If the 'xids' array didn't include all subtransactions, we have to
-		 * mark any snapshots taken as overflowed.
-		 */
-		if (running->subxid_status == SUBXIDS_IN_SUBTRANS)
-			procArray->lastOverflowedXid = latestObservedXid;
-		else
-		{
-			Assert(running->subxid_status == SUBXIDS_IN_ARRAY);
-			procArray->lastOverflowedXid = InvalidTransactionId;
-		}
-	}
-
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 * It also might not yet be set at all.
-	 */
-	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
-
-	/*
-	 * NB: No need to increment TransamVariables->xactCompletionCount here,
-	 * nobody can see it yet.
-	 */
-
+	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
-
-	KnownAssignedXidsDisplay(DEBUG3);
-	if (standbyState == STANDBY_SNAPSHOT_READY)
-		elog(DEBUG1, "recovery snapshots are now enabled");
-	else
-		elog(DEBUG1,
-			 "recovery snapshot waiting for non-overflowed snapshot or "
-			 "until oldest active xid on standby is at least %u (now %u)",
-			 standbySnapshotPendingXmin,
-			 running->oldestRunningXid);
 }
 
-/*
- * ProcArrayApplyXidAssignment
- *		Process an XLOG_XACT_ASSIGNMENT WAL record
- */
-void
-ProcArrayApplyXidAssignment(TransactionId topxid,
-							int nsubxids, TransactionId *subxids)
-{
-	TransactionId max_xid;
-	int			i;
-
-	Assert(standbyState >= STANDBY_INITIALIZED);
-
-	max_xid = TransactionIdLatest(topxid, nsubxids, subxids);
-
-	/*
-	 * Mark all the subtransactions as observed.
-	 *
-	 * NOTE: This will fail if the subxid contains too many previously
-	 * unobserved xids to fit into known-assigned-xids. That shouldn't happen
-	 * as the code stands, because xid-assignment records should never contain
-	 * more than PGPROC_MAX_CACHED_SUBXIDS entries.
-	 */
-	RecordKnownAssignedTransactionIds(max_xid);
-
-	/*
-	 * Notice that we update pg_subtrans with the top-level xid, rather than
-	 * the parent xid. This is a difference between normal processing and
-	 * recovery, yet is still correct in all cases. The reason is that
-	 * subtransaction commit is not marked in clog until commit processing, so
-	 * all aborted subtransactions have already been clearly marked in clog.
-	 * As a result we are able to refer directly to the top-level
-	 * transaction's state rather than skipping through all the intermediate
-	 * states in the subtransaction tree. This should be the first time we
-	 * have attempted to SubTransSetParent().
-	 */
-	for (i = 0; i < nsubxids; i++)
-		SubTransSetParent(subxids[i], topxid);
-
-	/* KnownAssignedXids isn't maintained yet, so we're done for now */
-	if (standbyState == STANDBY_INITIALIZED)
-		return;
-
-	/*
-	 * Uses same locking as transaction commit
-	 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * Remove subxids from known-assigned-xacts.
-	 */
-	KnownAssignedXidsRemoveTree(InvalidTransactionId, nsubxids, subxids);
-
-	/*
-	 * Advance lastOverflowedXid to be at least the last of these subxids.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid))
-		procArray->lastOverflowedXid = max_xid;
-
-	LWLockRelease(ProcArrayLock);
-}
 
 /*
  * TransactionIdIsInProgress -- is given transaction running in some backend
@@ -1379,23 +971,24 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * Aside from some shortcuts such as checking RecentXmin and our own Xid,
  * there are four possibilities for finding a running transaction:
  *
- * 1. The given Xid is a main transaction Id.  We will find this out cheaply
+ * 1. In Hot Standby mode, there are no transactions with XIDs active in the
+ * standby. Check pg_xact to see if the transaction is known to have committed
+ * or aborted, otherwise it's considered as running.
+ *
+ * 2. The given Xid is a main transaction Id.  We will find this out cheaply
  * by looking at ProcGlobal->xids.
  *
- * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
+ * 3. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
  *
- * 3. In Hot Standby mode, we must search the KnownAssignedXids list to see
- * if the Xid is running on the primary.
- *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * if that is running according to ProcGlobal->xids[].
  * This is the slowest way, but sadly it has to be done always if the others
  * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
- * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
- * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
+ * ProcArrayLock has to be held while we do 2 and 3.  If we save the top Xids
+ * while doing 2 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
  * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
@@ -1436,6 +1029,28 @@ TransactionIdIsInProgress(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * In hot standby mode, check pg_xact.
+	 *
+	 * With normal non-CSN snapshots, you must be careful to check
+	 * TransactionIdIsInProgress() before checking pg_xact, because a
+	 * transaction is marked as committed before it's removed from PGPROC. But
+	 * during recovery, we now use CSN snapshots so I think that's OK. See the
+	 * "NOTE" at the top of heapam_visibility.c.
+	 *
+	 * During recovery, the XID cannot be our own transaction, and the CSN
+	 * check handles subtransactions too, so we can skip the rest of the
+	 * function.
+	 */
+	if (RecoveryInProgress())
+	{
+		xc_during_recovery_inc();
+		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			return false;
+		else
+			return true;
+	}
+
 	/*
 	 * Also, we can handle our own transaction (and subtransactions) without
 	 * any access to shared memory.
@@ -1452,12 +1067,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (xids == NULL)
 	{
-		/*
-		 * In hot standby mode, reserve enough space to hold all xids in the
-		 * known-assigned list. If we later finish recovery, we no longer need
-		 * the bigger array, but we don't bother to shrink it.
-		 */
-		int			maxxids = RecoveryInProgress() ? TOTAL_MAX_CACHED_SUBXIDS : arrayP->maxProcs;
+		int			maxxids = arrayP->maxProcs;
 
 		xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId));
 		if (xids == NULL)
@@ -1552,33 +1162,6 @@ TransactionIdIsInProgress(TransactionId xid)
 			xids[nxids++] = pxid;
 	}
 
-	/*
-	 * Step 3: in hot standby mode, check the known-assigned-xids list.  XIDs
-	 * in the list must be treated as running.
-	 */
-	if (RecoveryInProgress())
-	{
-		/* none of the PGPROC entries should have XIDs in hot standby mode */
-		Assert(nxids == 0);
-
-		if (KnownAssignedXidExists(xid))
-		{
-			LWLockRelease(ProcArrayLock);
-			xc_by_known_assigned_inc();
-			return true;
-		}
-
-		/*
-		 * If the KnownAssignedXids overflowed, we have to check pg_subtrans
-		 * too.  Fetch all xids from KnownAssignedXids that are lower than
-		 * xid, since if xid is a subtransaction its parent will always have a
-		 * lower value.  Note we will collect both main and subXIDs here, but
-		 * there's no help for it.
-		 */
-		if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid))
-			nxids = KnownAssignedXidsGet(xids, xid);
-	}
-
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -1852,8 +1435,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * can't be tied to a specific database.)
 		 *
 		 * Also, while in recovery we cannot compute an accurate per-database
-		 * horizon, as all xids are managed via the KnownAssignedXids
-		 * machinery.
+		 * horizon, as all xids are managed via the CSN log machinery.
 		 */
 		if (proc->databaseId == MyDatabaseId ||
 			MyDatabaseId == InvalidOid ||
@@ -1866,11 +1448,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	}
 
 	/*
-	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
-	 * after lock is released.
+	 * If in recovery fetch oldest xid from last checkpoint.
+	 *
+	 * XXX: that can be much older than what we had previously with the
+	 * known-assigned-xids machinery. I think that's OK, given what this
+	 * function is used for during recovery?
 	 */
 	if (in_recovery)
-		kaxmin = KnownAssignedXidsGetOldestXmin();
+		kaxmin = procArray->oldest_running_primary_xid;
 
 	/*
 	 * No other information from shared state is needed, release the lock
@@ -2181,7 +1766,7 @@ GetMVCCSnapshotData(void)
 	TransactionId myxid;
 	uint64		curXactCompletionCount;
 	MVCCSnapshotShared snapshot;
-
+	XLogRecPtr	csn = InvalidXLogRecPtr;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -2355,27 +1940,8 @@ GetMVCCSnapshotData(void)
 	else
 	{
 		/*
-		 * We're in hot standby, so get XIDs from KnownAssignedXids.
-		 *
-		 * We store all xids directly into subxip[]. Here's why:
-		 *
-		 * In recovery we don't know which xids are top-level and which are
-		 * subxacts, a design choice that greatly simplifies xid processing.
-		 *
-		 * It seems like we would want to try to put xids into xip[] only, but
-		 * that is fairly small. We would either need to make that bigger or
-		 * to increase the rate at which we WAL-log xid assignment; neither is
-		 * an appealing choice.
-		 *
-		 * We could try to store xids into xip[] first and then into subxip[]
-		 * if there are too many xids. That only works if the snapshot doesn't
-		 * overflow because we do not search subxip[] in that case. A simpler
-		 * way is to just store all xids in the subxip array because this is
-		 * by far the bigger array. We just leave the xip array empty.
-		 *
-		 * Either way we need to change the way XidInMVCCSnapshot() works
-		 * depending upon when the snapshot was taken, or change normal
-		 * snapshot processing so it matches.
+		 * We're in hot standby, so get the current CSN. That's used to
+		 * determine which transactions committed before this snapshot.
 		 *
 		 * Note: It is possible for recovery to end before we finish taking
 		 * the snapshot, and for newly assigned transaction ids to be added to
@@ -2383,14 +1949,17 @@ GetMVCCSnapshotData(void)
 		 * those newly added transaction ids would be filtered away, so we
 		 * need not be concerned about them.
 		 */
-		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
-												  xmax);
+		xmin = procArray->oldest_running_primary_xid;
 
-		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
-			suboverflowed = true;
+		/*
+		 * Take CSN under ProcArrayLock so the snapshot stays synchronized.
+		 * (XXX: not sure that's strictly required.) This is what determines
+		 * which transactions we consider finished and which are still in
+		 * progress.
+		 */
+		csn = TransamVariables->latestCommitLSN;
 	}
 
-
 	/*
 	 * Fetch into local variable while ProcArrayLock is held - the
 	 * LWLockRelease below is a barrier, ensuring this happens inside the
@@ -2507,6 +2076,8 @@ GetMVCCSnapshotData(void)
 		latestSnapshotShared = snapshot;
 	}
 
+	snapshot->snapshotCsn = csn;
+
 	return snapshot;
 }
 
@@ -2662,9 +2233,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * The returned data structure is statically allocated; caller should not
  * modify it, and must not assume it is valid past the next call.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
- *
  * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
@@ -2695,6 +2263,7 @@ GetRunningTransactionData(void)
 	int			subcount;
 	bool		suboverflowed;
 
+	/* This is never executed during recovery */
 	Assert(!RecoveryInProgress());
 
 	/*
@@ -2861,15 +2430,16 @@ GetRunningTransactionData(void)
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
- * This is never executed during recovery so there is no need to look at
- * KnownAssignedXids.
+ * If allDbs is false, skip processes attached to other databases.
+ *
+ * This is never executed during recovery.
  *
  * We don't worry about updating other counters, we want to keep this as
  * simple as possible and leave GetMVCCSnapshotData() as the primary code for
  * that bookkeeping.
  */
 TransactionId
-GetOldestActiveTransactionId(void)
+GetOldestActiveTransactionId(bool allDbs)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2890,11 +2460,13 @@ GetOldestActiveTransactionId(void)
 	LWLockRelease(XidGenLock);
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray checking each xid.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
+		int			pgprocno = arrayP->pgprocnos[index];
+		PGPROC	   *proc = &allProcs[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2903,6 +2475,9 @@ GetOldestActiveTransactionId(void)
 		if (!TransactionIdIsNormal(xid))
 			continue;
 
+		if (!allDbs && proc->databaseId != MyDatabaseId)
+			continue;
+
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
@@ -2981,8 +2556,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
-	 * *not* use KnownAssignedXidsGetOldestXmin() since the KnownAssignedXids
-	 * machinery can miss values and return an older value than is safe.
+	 * *not* use oldest_running_primary_xid since the XID tracking machinery
+	 * can miss values and return an older value than is safe.
 	 */
 	if (!recovery_in_progress)
 	{
@@ -3400,6 +2975,9 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  * but that would not be true in the case of FATAL errors lagging in array,
  * but we already know those are bogus anyway, so we skip that test.
  *
+ * XXX: KnownAssignedXids is gone so the above comment needs updating. Is
+ * the code still correct? I think so but need to double-check.
+ *
  * If dbOid is valid we skip backends attached to other databases.
  *
  * Be careful to *not* pfree the result from this function. We reuse
@@ -4071,14 +3649,14 @@ static void
 DisplayXidCache(void)
 {
 	fprintf(stderr,
-			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, knownassigned: %ld, nooflo: %ld, slow: %ld\n",
+			"XidCache: xmin: %ld, known: %ld, myxact: %ld, latest: %ld, mainxid: %ld, childxid: %ld, during_recovery: %ld, nooflo: %ld, slow: %ld\n",
 			xc_by_recent_xmin,
 			xc_by_known_xact,
 			xc_by_my_xact,
 			xc_by_latest_xid,
 			xc_by_main_xid,
 			xc_by_child_xid,
-			xc_by_known_assigned,
+			xc_during_recovery,
 			xc_no_overflow,
 			xc_slow_answer);
 }
@@ -4325,61 +3903,6 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 }
 
 
-/* ----------------------------------------------
- *		KnownAssignedTransactionIds sub-module
- * ----------------------------------------------
- */
-
-/*
- * In Hot Standby mode, we maintain a list of transactions that are (or were)
- * running on the primary at the current point in WAL.  These XIDs must be
- * treated as running by standby transactions, even though they are not in
- * the standby server's PGPROC array.
- *
- * We record all XIDs that we know have been assigned.  That includes all the
- * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
- * been assigned.  We can deduce the existence of unobserved XIDs because we
- * know XIDs are assigned in sequence, with no gaps.  The KnownAssignedXids
- * list expands as new XIDs are observed or inferred, and contracts when
- * transaction completion records arrive.
- *
- * During hot standby we do not fret too much about the distinction between
- * top-level XIDs and subtransaction XIDs. We store both together in the
- * KnownAssignedXids list.  In backends, this is copied into snapshots in
- * GetMVCCSnapshotData(), taking advantage of the fact that XidInMVCCSnapshot()
- * doesn't care about the distinction either.  Subtransaction XIDs are
- * effectively treated as top-level XIDs and in the typical case pg_subtrans
- * links are *not* maintained (which does not affect visibility).
- *
- * We have room in KnownAssignedXids and in snapshots to hold maxProcs *
- * (1 + PGPROC_MAX_CACHED_SUBXIDS) XIDs, so every primary transaction must
- * report its subtransaction XIDs in a WAL XLOG_XACT_ASSIGNMENT record at
- * least every PGPROC_MAX_CACHED_SUBXIDS.  When we receive one of these
- * records, we mark the subXIDs as children of the top XID in pg_subtrans,
- * and then remove them from KnownAssignedXids.  This prevents overflow of
- * KnownAssignedXids and snapshots, at the cost that status checks for these
- * subXIDs will take a slower path through TransactionIdIsInProgress().
- * This means that KnownAssignedXids is not necessarily complete for subXIDs,
- * though it should be complete for top-level XIDs; this is the same situation
- * that holds with respect to the PGPROC entries in normal running.
- *
- * When we throw away subXIDs from KnownAssignedXids, we need to keep track of
- * that, similarly to tracking overflow of a PGPROC's subxids array.  We do
- * that by remembering the lastOverflowedXid, ie the last thrown-away subXID.
- * As long as that is within the range of interesting XIDs, we have to assume
- * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs
- * on primary when 65th subXID arrives, whereas on standby it occurs when 64th
- * subXID arrives - that is not an error.)
- *
- * Should a backend on primary somehow disappear before it can write an abort
- * record, then we just leave those XIDs in KnownAssignedXids. They actually
- * aborted but we think they were running; the distinction is irrelevant
- * because either way any changes done by the transaction are not visible to
- * backends in the standby.  We prune KnownAssignedXids when
- * XLOG_RUNNING_XACTS arrives, to forestall possible overflow of the
- * array due to such dead XIDs.
- */
-
 /*
  * RecordKnownAssignedTransactionIds
  *		Record the given XID in KnownAssignedXids, as well as any preceding
@@ -4394,7 +3917,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
 void
 RecordKnownAssignedTransactionIds(TransactionId xid)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	Assert(InHotStandby);
 	Assert(TransactionIdIsValid(xid));
 	Assert(TransactionIdIsValid(latestObservedXid));
 
@@ -4412,38 +3935,19 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 
 		/*
 		 * Extend subtrans like we do in GetNewTransactionId() during normal
-		 * operation using individual extend steps. Note that we do not need
-		 * to extend clog since its extensions are WAL logged.
-		 *
-		 * This part has to be done regardless of standbyState since we
-		 * immediately start assigning subtransactions to their toplevel
-		 * transactions.
+		 * operation using individual extend steps. And CSN log, too. Note
+		 * that we do not need to extend clog since its extensions are WAL
+		 * logged.
 		 */
 		next_expected_xid = latestObservedXid;
 		while (TransactionIdPrecedes(next_expected_xid, xid))
 		{
 			TransactionIdAdvance(next_expected_xid);
 			ExtendSUBTRANS(next_expected_xid);
+			ExtendCSNLog(next_expected_xid);
 		}
 		Assert(next_expected_xid == xid);
 
-		/*
-		 * If the KnownAssignedXids machinery isn't up yet, there's nothing
-		 * more to do since we don't track assigned xids yet.
-		 */
-		if (standbyState <= STANDBY_INITIALIZED)
-		{
-			latestObservedXid = xid;
-			return;
-		}
-
-		/*
-		 * Add (latestObservedXid, xid] onto the KnownAssignedXids array.
-		 */
-		next_expected_xid = latestObservedXid;
-		TransactionIdAdvance(next_expected_xid);
-		KnownAssignedXidsAdd(next_expected_xid, xid, false);
-
 		/*
 		 * Now we can advance latestObservedXid
 		 */
@@ -4455,805 +3959,61 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
 }
 
 /*
- * ExpireTreeKnownAssignedTransactionIds
- *		Remove the given XIDs from KnownAssignedXids.
+ * ProcArrayRecoveryEndTransaction
+ *
+ * Called during recovery in analogy with and in place of
+ * ProcArrayEndTransaction(). The transaction becomes visible to any new
+ * snapshots taken after this. 'max_xid' is the highest (sub)XID of the
+ * committed transaction, and 'lsn' is LSN of the commit record.
  *
- * Called during recovery in analogy with and in place of ProcArrayEndTransaction()
+ * The transaction and all its subtransactions have been already marked as
+ * committed in the CLOG and in the CSNLOG.
  */
 void
-ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
-									  TransactionId *subxids, TransactionId max_xid)
+ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn)
 {
-	Assert(standbyState >= STANDBY_INITIALIZED);
+	TransactionId oldest_running_primary_xid;
+
+	Assert(InHotStandby);
+
+	/*
+	 * If this was the oldest XID that was still running, advance it. This is
+	 * important for advancing the global xmin, which avoids unnecessary
+	 * recovery conflicts
+	 *
+	 * No locking required because this runs in the startup process.
+	 *
+	 * XXX: the caller actually has a list of XIDs that just committed. We
+	 * could save some clog lookups by taking advantage of that list.
+	 */
+	oldest_running_primary_xid = procArray->oldest_running_primary_xid;
+	while (oldest_running_primary_xid < max_xid)
+	{
+		if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
+			!TransactionIdDidAbort(oldest_running_primary_xid))
+		{
+			break;
+		}
+		TransactionIdAdvance(oldest_running_primary_xid);
+	}
+	if (max_xid == oldest_running_primary_xid)
+		TransactionIdAdvance(oldest_running_primary_xid);
 
 	/*
 	 * Uses same locking as transaction commit
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
-	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
-
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
 	/* ... and xactCompletionCount */
 	TransamVariables->xactCompletionCount++;
 
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireAllKnownAssignedTransactionIds
- *		Remove all entries in KnownAssignedXids and reset lastOverflowedXid.
- */
-void
-ExpireAllKnownAssignedTransactionIds(void)
-{
-	FullTransactionId latestXid;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
-
-	/* Reset latestCompletedXid to nextXid - 1 */
-	Assert(FullTransactionIdIsValid(TransamVariables->nextXid));
-	latestXid = TransamVariables->nextXid;
-	FullTransactionIdRetreat(&latestXid);
-	TransamVariables->latestCompletedXid = latestXid;
-
-	/*
-	 * Any transactions that were in-progress were effectively aborted, so
-	 * advance xactCompletionCount.
-	 */
-	TransamVariables->xactCompletionCount++;
-
-	/*
-	 * Reset lastOverflowedXid.  Currently, lastOverflowedXid has no use after
-	 * the call of this function.  But do this for unification with what
-	 * ExpireOldKnownAssignedTransactionIds() do.
-	 */
-	procArray->lastOverflowedXid = InvalidTransactionId;
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * ExpireOldKnownAssignedTransactionIds
- *		Remove KnownAssignedXids entries preceding the given XID and
- *		potentially reset lastOverflowedXid.
- */
-void
-ExpireOldKnownAssignedTransactionIds(TransactionId xid)
-{
-	TransactionId latestXid;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
-	latestXid = xid;
-	TransactionIdRetreat(latestXid);
-	MaintainLatestCompletedXidRecovery(latestXid);
-
-	/* ... and xactCompletionCount */
-	TransamVariables->xactCompletionCount++;
-
-	/*
-	 * Reset lastOverflowedXid if we know all transactions that have been
-	 * possibly running are being gone.  Not doing so could cause an incorrect
-	 * lastOverflowedXid value, which makes extra snapshots be marked as
-	 * suboverflowed.
-	 */
-	if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
-		procArray->lastOverflowedXid = InvalidTransactionId;
-	KnownAssignedXidsRemovePreceding(xid);
-	LWLockRelease(ProcArrayLock);
-}
-
-/*
- * KnownAssignedTransactionIdsIdleMaintenance
- *		Opportunistically do maintenance work when the startup process
- *		is about to go idle.
- */
-void
-KnownAssignedTransactionIdsIdleMaintenance(void)
-{
-	KnownAssignedXidsCompress(KAX_STARTUP_PROCESS_IDLE, false);
-}
-
-
-/*
- * Private module functions to manipulate KnownAssignedXids
- *
- * There are 5 main uses of the KnownAssignedXids data structure:
- *
- *	* backends taking snapshots - all valid XIDs need to be copied out
- *	* backends seeking to determine presence of a specific XID
- *	* startup process adding new known-assigned XIDs
- *	* startup process removing specific XIDs as transactions end
- *	* startup process pruning array when special WAL records arrive
- *
- * This data structure is known to be a hot spot during Hot Standby, so we
- * go to some lengths to make these operations as efficient and as concurrent
- * as possible.
- *
- * The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
- * order, to be exact --- to allow binary search for specific XIDs.  Note:
- * in general TransactionIdPrecedes would not provide a total order, but
- * we know that the entries present at any instant should not extend across
- * a large enough fraction of XID space to wrap around (the primary would
- * shut down for fear of XID wrap long before that happens).  So it's OK to
- * use TransactionIdPrecedes as a binary-search comparator.
- *
- * It's cheap to maintain the sortedness during insertions, since new known
- * XIDs are always reported in XID order; we just append them at the right.
- *
- * To keep individual deletions cheap, we need to allow gaps in the array.
- * This is implemented by marking array elements as valid or invalid using
- * the parallel boolean array KnownAssignedXidsValid[].  A deletion is done
- * by setting KnownAssignedXidsValid[i] to false, *without* clearing the
- * XID entry itself.  This preserves the property that the XID entries are
- * sorted, so we can do binary searches easily.  Periodically we compress
- * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
- *
- * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
- * are those with indexes tail <= i < head; items outside this subscript range
- * have unspecified contents.  When head reaches the end of the array, we
- * force compression of unused entries rather than wrapping around, since
- * allowing wraparound would greatly complicate the search logic.  We maintain
- * an explicit tail pointer so that pruning of old XIDs can be done without
- * immediately moving the array contents.  In most cases only a small fraction
- * of the array contains valid entries at any instant.
- *
- * Although only the startup process can ever change the KnownAssignedXids
- * data structure, we still need interlocking so that standby backends will
- * not observe invalid intermediate states.  The convention is that backends
- * must hold shared ProcArrayLock to examine the array.  To remove XIDs from
- * the array, the startup process must hold ProcArrayLock exclusively, for
- * the usual transactional reasons (compare commit/abort of a transaction
- * during normal running).  Compressing unused entries out of the array
- * likewise requires exclusive lock.  To add XIDs to the array, we just insert
- * them into slots to the right of the head pointer and then advance the head
- * pointer.  This doesn't require any lock at all, but on machines with weak
- * memory ordering, we need to be careful that other processors see the array
- * element changes before they see the head pointer change.  We handle this by
- * using memory barriers when reading or writing the head/tail pointers (unless
- * the caller holds ProcArrayLock exclusively).
- *
- * Algorithmic analysis:
- *
- * If we have a maximum of M slots, with N XIDs currently spread across
- * S elements then we have N <= S <= M always.
- *
- *	* Adding a new XID is O(1) and needs no lock (unless compression must
- *		happen)
- *	* Compressing the array is O(S) and requires exclusive lock
- *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
- *	* Checking for an XID is O(logS) and requires shared lock
- *
- * In comparison, using a hash table for KnownAssignedXids would mean that
- * taking snapshots would be O(M). If we can maintain S << M then the
- * sorted array technique will deliver significantly faster snapshots.
- * If we try to keep S too small then we will spend too much time compressing,
- * so there is an optimal point for any workload mix. We use a heuristic to
- * decide when to compress the array, though trimming also helps reduce
- * frequency of compressing. The heuristic requires us to track the number of
- * currently valid XIDs in the array (N).  Except in special cases, we'll
- * compress when S >= 2N.  Bounding S at 2N in turn bounds the time for
- * taking a snapshot to be O(N), which it would have to be anyway.
- */
-
-
-/*
- * Compress KnownAssignedXids by shifting valid data down to the start of the
- * array, removing any gaps.
- *
- * A compression step is forced if "reason" is KAX_NO_SPACE, otherwise
- * we do it only if a heuristic indicates it's a good time to do it.
- *
- * Compression requires holding ProcArrayLock in exclusive mode.
- * Caller must pass haveLock = true if it already holds the lock.
- */
-static void
-KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			head,
-				tail,
-				nelements;
-	int			compress_index;
-	int			i;
-
-	/* Counters for compression heuristics */
-	static unsigned int transactionEndsCounter;
-	static TimestampTz lastCompressTs;
-
-	/* Tuning constants */
-#define KAX_COMPRESS_FREQUENCY 128	/* in transactions */
-#define KAX_COMPRESS_IDLE_INTERVAL 1000 /* in ms */
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-	nelements = head - tail;
-
-	/*
-	 * If we can choose whether to compress, use a heuristic to avoid
-	 * compressing too often or not often enough.  "Compress" here simply
-	 * means moving the values to the beginning of the array, so it is not as
-	 * complex or costly as typical data compression algorithms.
-	 */
-	if (nelements == pArray->numKnownAssignedXids)
-	{
-		/*
-		 * When there are no gaps between head and tail, don't bother to
-		 * compress, except in the KAX_NO_SPACE case where we must compress to
-		 * create some space after the head.
-		 */
-		if (reason != KAX_NO_SPACE)
-			return;
-	}
-	else if (reason == KAX_TRANSACTION_END)
-	{
-		/*
-		 * Consider compressing only once every so many commits.  Frequency
-		 * determined by benchmarks.
-		 */
-		if ((transactionEndsCounter++) % KAX_COMPRESS_FREQUENCY != 0)
-			return;
-
-		/*
-		 * Furthermore, compress only if the used part of the array is less
-		 * than 50% full (see comments above).
-		 */
-		if (nelements < 2 * pArray->numKnownAssignedXids)
-			return;
-	}
-	else if (reason == KAX_STARTUP_PROCESS_IDLE)
-	{
-		/*
-		 * We're about to go idle for lack of new WAL, so we might as well
-		 * compress.  But not too often, to avoid ProcArray lock contention
-		 * with readers.
-		 */
-		if (lastCompressTs != 0)
-		{
-			TimestampTz compress_after;
-
-			compress_after = TimestampTzPlusMilliseconds(lastCompressTs,
-														 KAX_COMPRESS_IDLE_INTERVAL);
-			if (GetCurrentTimestamp() < compress_after)
-				return;
-		}
-	}
-
-	/* Need to compress, so get the lock if we don't have it. */
-	if (!haveLock)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-	/*
-	 * We compress the array by reading the valid values from tail to head,
-	 * re-aligning data to 0th element.
-	 */
-	compress_index = 0;
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
-			KnownAssignedXidsValid[compress_index] = true;
-			compress_index++;
-		}
-	}
-	Assert(compress_index == pArray->numKnownAssignedXids);
-
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = compress_index;
-
-	if (!haveLock)
-		LWLockRelease(ProcArrayLock);
-
-	/* Update timestamp for maintenance.  No need to hold lock for this. */
-	lastCompressTs = GetCurrentTimestamp();
-}
-
-/*
- * Add xids into KnownAssignedXids at the head of the array.
- *
- * xids from from_xid to to_xid, inclusive, are added to the array.
- *
- * If exclusive_lock is true then caller already holds ProcArrayLock in
- * exclusive mode, so we need no extra locking here.  Else caller holds no
- * lock, so we need to be sure we maintain sufficient interlocks against
- * concurrent readers.  (Only the startup process ever calls this, so no need
- * to worry about concurrent writers.)
- */
-static void
-KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
-					 bool exclusive_lock)
-{
-	ProcArrayStruct *pArray = procArray;
-	TransactionId next_xid;
-	int			head,
-				tail;
-	int			nxids;
-	int			i;
-
-	Assert(TransactionIdPrecedesOrEquals(from_xid, to_xid));
-
-	/*
-	 * Calculate how many array slots we'll need.  Normally this is cheap; in
-	 * the unusual case where the XIDs cross the wrap point, we do it the hard
-	 * way.
-	 */
-	if (to_xid >= from_xid)
-		nxids = to_xid - from_xid + 1;
-	else
-	{
-		nxids = 1;
-		next_xid = from_xid;
-		while (TransactionIdPrecedes(next_xid, to_xid))
-		{
-			nxids++;
-			TransactionIdAdvance(next_xid);
-		}
-	}
-
-	/*
-	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
-	 */
-	head = pArray->headKnownAssignedXids;
-	tail = pArray->tailKnownAssignedXids;
-
-	Assert(head >= 0 && head <= pArray->maxKnownAssignedXids);
-	Assert(tail >= 0 && tail < pArray->maxKnownAssignedXids);
-
-	/*
-	 * Verify that insertions occur in TransactionId sequence.  Note that even
-	 * if the last existing element is marked invalid, it must still have a
-	 * correctly sequenced XID value.
-	 */
-	if (head > tail &&
-		TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], from_xid))
-	{
-		KnownAssignedXidsDisplay(LOG);
-		elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
-	}
-
-	/*
-	 * If our xids won't fit in the remaining space, compress out free space
-	 */
-	if (head + nxids > pArray->maxKnownAssignedXids)
-	{
-		KnownAssignedXidsCompress(KAX_NO_SPACE, exclusive_lock);
-
-		head = pArray->headKnownAssignedXids;
-		/* note: we no longer care about the tail pointer */
-
-		/*
-		 * If it still won't fit then we're out of memory
-		 */
-		if (head + nxids > pArray->maxKnownAssignedXids)
-			elog(ERROR, "too many KnownAssignedXids");
-	}
-
-	/* Now we can insert the xids into the space starting at head */
-	next_xid = from_xid;
-	for (i = 0; i < nxids; i++)
-	{
-		KnownAssignedXids[head] = next_xid;
-		KnownAssignedXidsValid[head] = true;
-		TransactionIdAdvance(next_xid);
-		head++;
-	}
-
-	/* Adjust count of number of valid entries */
-	pArray->numKnownAssignedXids += nxids;
-
-	/*
-	 * Now update the head pointer.  We use a write barrier to ensure that
-	 * other processors see the above array updates before they see the head
-	 * pointer change.  The barrier isn't required if we're holding
-	 * ProcArrayLock exclusively.
-	 */
-	if (!exclusive_lock)
-		pg_write_barrier();
-
-	pArray->headKnownAssignedXids = head;
-}
-
-/*
- * KnownAssignedXidsSearch
- *
- * Searches KnownAssignedXids for a specific xid and optionally removes it.
- * Returns true if it was found, false if not.
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- * Exclusive lock must be held for remove = true.
- */
-static bool
-KnownAssignedXidsSearch(TransactionId xid, bool remove)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			first,
-				last;
-	int			head;
-	int			tail;
-	int			result_index = -1;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	/*
-	 * Only the startup process removes entries, so we don't need the read
-	 * barrier in that case.
-	 */
-	if (!remove)
-		pg_read_barrier();		/* pairs with KnownAssignedXidsAdd */
-
-	/*
-	 * Standard binary search.  Note we can ignore the KnownAssignedXidsValid
-	 * array here, since even invalid entries will contain sorted XIDs.
-	 */
-	first = tail;
-	last = head - 1;
-	while (first <= last)
-	{
-		int			mid_index;
-		TransactionId mid_xid;
-
-		mid_index = (first + last) / 2;
-		mid_xid = KnownAssignedXids[mid_index];
-
-		if (xid == mid_xid)
-		{
-			result_index = mid_index;
-			break;
-		}
-		else if (TransactionIdPrecedes(xid, mid_xid))
-			last = mid_index - 1;
-		else
-			first = mid_index + 1;
-	}
-
-	if (result_index < 0)
-		return false;			/* not in array */
-
-	if (!KnownAssignedXidsValid[result_index])
-		return false;			/* in array, but invalid */
-
-	if (remove)
-	{
-		KnownAssignedXidsValid[result_index] = false;
-
-		pArray->numKnownAssignedXids--;
-		Assert(pArray->numKnownAssignedXids >= 0);
-
-		/*
-		 * If we're removing the tail element then advance tail pointer over
-		 * any invalid elements.  This will speed future searches.
-		 */
-		if (result_index == tail)
-		{
-			tail++;
-			while (tail < head && !KnownAssignedXidsValid[tail])
-				tail++;
-			if (tail >= head)
-			{
-				/* Array is empty, so we can reset both pointers */
-				pArray->headKnownAssignedXids = 0;
-				pArray->tailKnownAssignedXids = 0;
-			}
-			else
-			{
-				pArray->tailKnownAssignedXids = tail;
-			}
-		}
-	}
-
-	return true;
-}
-
-/*
- * Is the specified XID present in KnownAssignedXids[]?
- *
- * Caller must hold ProcArrayLock in shared or exclusive mode.
- */
-static bool
-KnownAssignedXidExists(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	return KnownAssignedXidsSearch(xid, false);
-}
-
-/*
- * Remove the specified XID from KnownAssignedXids[].
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemove(TransactionId xid)
-{
-	Assert(TransactionIdIsValid(xid));
-
-	elog(DEBUG4, "remove KnownAssignedXid %u", xid);
-
-	/*
-	 * Note: we cannot consider it an error to remove an XID that's not
-	 * present.  We intentionally remove subxact IDs while processing
-	 * XLOG_XACT_ASSIGNMENT, to avoid array overflow.  Then those XIDs will be
-	 * removed again when the top-level xact commits or aborts.
-	 *
-	 * It might be possible to track such XIDs to distinguish this case from
-	 * actual errors, but it would be complicated and probably not worth it.
-	 * So, just ignore the search result.
-	 */
-	(void) KnownAssignedXidsSearch(xid, true);
-}
-
-/*
- * KnownAssignedXidsRemoveTree
- *		Remove xid (if it's not InvalidTransactionId) and all the subxids.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemoveTree(TransactionId xid, int nsubxids,
-							TransactionId *subxids)
-{
-	int			i;
-
-	if (TransactionIdIsValid(xid))
-		KnownAssignedXidsRemove(xid);
-
-	for (i = 0; i < nsubxids; i++)
-		KnownAssignedXidsRemove(subxids[i]);
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_TRANSACTION_END, true);
-}
-
-/*
- * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid
- * then clear the whole table.
- *
- * Caller must hold ProcArrayLock in exclusive mode.
- */
-static void
-KnownAssignedXidsRemovePreceding(TransactionId removeXid)
-{
-	ProcArrayStruct *pArray = procArray;
-	int			count = 0;
-	int			head,
-				tail,
-				i;
-
-	if (!TransactionIdIsValid(removeXid))
-	{
-		elog(DEBUG4, "removing all KnownAssignedXids");
-		pArray->numKnownAssignedXids = 0;
-		pArray->headKnownAssignedXids = pArray->tailKnownAssignedXids = 0;
-		return;
-	}
-
-	elog(DEBUG4, "prune KnownAssignedXids to %u", removeXid);
-
-	/*
-	 * Mark entries invalid starting at the tail.  Since array is sorted, we
-	 * can stop as soon as we reach an entry >= removeXid.
-	 */
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			if (TransactionIdFollowsOrEquals(knownXid, removeXid))
-				break;
-
-			if (!StandbyTransactionIdIsPrepared(knownXid))
-			{
-				KnownAssignedXidsValid[i] = false;
-				count++;
-			}
-		}
-	}
-
-	pArray->numKnownAssignedXids -= count;
-	Assert(pArray->numKnownAssignedXids >= 0);
-
-	/*
-	 * Advance the tail pointer if we've marked the tail item invalid.
-	 */
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-			break;
-	}
-	if (i >= head)
-	{
-		/* Array is empty, so we can reset both pointers */
-		pArray->headKnownAssignedXids = 0;
-		pArray->tailKnownAssignedXids = 0;
-	}
-	else
-	{
-		pArray->tailKnownAssignedXids = i;
-	}
-
-	/* Opportunistically compress the array */
-	KnownAssignedXidsCompress(KAX_PRUNE, true);
-}
-
-/*
- * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids.
- * We filter out anything >= xmax.
- *
- * Returns the number of XIDs stored into xarray[].  Caller is responsible
- * that array is large enough.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax)
-{
-	TransactionId xtmp = InvalidTransactionId;
-
-	return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax);
-}
-
-/*
- * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus
- * we reduce *xmin to the lowest xid value seen if not already lower.
- *
- * Caller must hold ProcArrayLock in (at least) shared mode.
- */
-static int
-KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
-							   TransactionId xmax)
-{
-	int			count = 0;
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop. We can stop
-	 * once we reach the initially seen head, since we are certain that an xid
-	 * cannot enter and then leave the array while we hold ProcArrayLock.  We
-	 * might miss newly-added xids, but they should be >= xmax so irrelevant
-	 * anyway.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-		{
-			TransactionId knownXid = KnownAssignedXids[i];
-
-			/*
-			 * Update xmin if required.  Only the first XID need be checked,
-			 * since the array is sorted.
-			 */
-			if (count == 0 &&
-				TransactionIdPrecedes(knownXid, *xmin))
-				*xmin = knownXid;
-
-			/*
-			 * Filter out anything >= xmax, again relying on sorted property
-			 * of array.
-			 */
-			if (TransactionIdIsValid(xmax) &&
-				TransactionIdFollowsOrEquals(knownXid, xmax))
-				break;
-
-			/* Add knownXid into output array */
-			xarray[count++] = knownXid;
-		}
-	}
-
-	return count;
-}
-
-/*
- * Get oldest XID in the KnownAssignedXids array, or InvalidTransactionId
- * if nothing there.
- */
-static TransactionId
-KnownAssignedXidsGetOldestXmin(void)
-{
-	int			head,
-				tail;
-	int			i;
-
-	/*
-	 * Fetch head just once, since it may change while we loop.
-	 */
-	tail = procArray->tailKnownAssignedXids;
-	head = procArray->headKnownAssignedXids;
-
-	pg_read_barrier();			/* pairs with KnownAssignedXidsAdd */
-
-	for (i = tail; i < head; i++)
-	{
-		/* Skip any gaps in the array */
-		if (KnownAssignedXidsValid[i])
-			return KnownAssignedXids[i];
-	}
-
-	return InvalidTransactionId;
-}
-
-/*
- * Display KnownAssignedXids to provide debug trail
- *
- * Currently this is only called within startup process, so we need no
- * special locking.
- *
- * Note this is pretty expensive, and much of the expense will be incurred
- * even if the elog message will get discarded.  It's not currently called
- * in any performance-critical places, however, so no need to be tenser.
- */
-static void
-KnownAssignedXidsDisplay(int trace_level)
-{
-	ProcArrayStruct *pArray = procArray;
-	StringInfoData buf;
-	int			head,
-				tail,
-				i;
-	int			nxids = 0;
-
-	tail = pArray->tailKnownAssignedXids;
-	head = pArray->headKnownAssignedXids;
-
-	initStringInfo(&buf);
-
-	for (i = tail; i < head; i++)
-	{
-		if (KnownAssignedXidsValid[i])
-		{
-			nxids++;
-			appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
-		}
-	}
-
-	elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
-		 nxids,
-		 pArray->numKnownAssignedXids,
-		 pArray->tailKnownAssignedXids,
-		 pArray->headKnownAssignedXids,
-		 buf.data);
-
-	pfree(buf.data);
-}
-
-/*
- * KnownAssignedXidsReset
- *		Resets KnownAssignedXids to be empty
- */
-static void
-KnownAssignedXidsReset(void)
-{
-	ProcArrayStruct *pArray = procArray;
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(lsn > TransamVariables->latestCommitLSN);
+	TransamVariables->latestCommitLSN = lsn;
 
-	pArray->numKnownAssignedXids = 0;
-	pArray->tailKnownAssignedXids = 0;
-	pArray->headKnownAssignedXids = 0;
+	procArray->oldest_running_primary_xid = oldest_running_primary_xid;
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..217b1670f5b 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -139,8 +139,6 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.procNumber = MyProcNumber;
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
-
-	standbyState = STANDBY_INITIALIZED;
 }
 
 /*
@@ -168,9 +166,6 @@ ShutdownRecoveryTransactionEnvironment(void)
 	if (RecoveryLockHash == NULL)
 		return;
 
-	/* Mark all tracked in-progress transactions as finished. */
-	ExpireAllKnownAssignedTransactionIds();
-
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
@@ -1167,7 +1162,7 @@ standby_redo(XLogReaderState *record)
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
 	/* Do nothing if we're not in hot standby mode */
-	if (standbyState == STANDBY_DISABLED)
+	if (!InHotStandby)
 		return;
 
 	if (info == XLOG_STANDBY_LOCK)
@@ -1182,18 +1177,21 @@ standby_redo(XLogReaderState *record)
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
+		/*
+		 * XXX: running xacts records were previously used to update
+		 * known-assigned xids, but now we only need it for the logical
+		 * replication snapbuilder stuff. And for the
+		 * pg_stat_report_stat(true) call below.
+		 */
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
-		RunningTransactionsData running;
 
-		running.xcnt = xlrec->xcnt;
-		running.subxcnt = xlrec->subxcnt;
-		running.subxid_status = xlrec->subxid_overflow ? SUBXIDS_MISSING : SUBXIDS_IN_ARRAY;
-		running.nextXid = xlrec->nextXid;
-		running.latestCompletedXid = xlrec->latestCompletedXid;
-		running.oldestRunningXid = xlrec->oldestRunningXid;
-		running.xids = xlrec->xids;
-
-		ProcArrayApplyRecoveryInfo(&running);
+		/*
+		 * Remember the oldest XID that was running at the time. Normally, all
+		 * transaction aborts and commits are WAL-logged, so our
+		 * oldestRunningXid value should be up-to-date, but if not, this
+		 * allows us to resynchronize.
+		 */
+		ProcArrayUpdateOldestRunningXid(xlrec->oldestRunningXid);
 
 		/*
 		 * The startup process currently has no convenient way to schedule
@@ -1224,50 +1222,46 @@ standby_redo(XLogReaderState *record)
  *
  * This is used for Hot Standby as follows:
  *
- * We can move directly to STANDBY_SNAPSHOT_READY at startup if we
- * start from a shutdown checkpoint because we know nothing was running
- * at that time and our recovery snapshot is known empty. In the more
- * typical case of an online checkpoint we need to jump through a few
- * hoops to get a correct recovery snapshot and this requires a two or
- * sometimes a three stage process.
+ * We can enter hot standby mode and start accepting read-only queries
+ * immediately at startup if we start from a shutdown checkpoint, because we
+ * know nothing was running at that time and our recovery snapshot is known
+ * empty. In the more typical case of an online checkpoint, the checkpoint
+ * record doesn't contain all the necessary information about running
+ * transaction state, and we need to jump through a few hoops to get a correct
+ * recovery snapshot.
  *
- * The initial snapshot must contain all running xids and all current
- * AccessExclusiveLocks at a point in time on the standby. Assembling
- * that information while the server is running requires many and
- * various LWLocks, so we choose to derive that information piece by
- * piece and then re-assemble that info on the standby. When that
- * information is fully assembled we move to STANDBY_SNAPSHOT_READY.
+ * The initial snapshot must contain all current AccessExclusiveLocks at a
+ * point in time on the standby. Assembling that information while the server
+ * is running requires many and various LWLocks, so we choose to derive that
+ * information piece by piece and then re-assemble that info on the standby.
  *
- * Since locking on the primary when we derive the information is not
- * strict, we note that there is a time window between the derivation and
- * writing to WAL of the derived information. That allows race conditions
- * that we must resolve, since xids and locks may enter or leave the
- * snapshot during that window. This creates the issue that an xid or
- * lock may start *after* the snapshot has been derived yet *before* the
- * snapshot is logged in the running xacts WAL record. We resolve this by
- * starting to accumulate changes at a point just prior to when we derive
- * the snapshot on the primary, then ignore duplicates when we later apply
- * the snapshot from the running xacts record. This is implemented during
- * CreateCheckPoint() where we use the logical checkpoint location as
- * our starting point and then write the running xacts record immediately
- * before writing the main checkpoint WAL record. Since we always start
- * up from a checkpoint and are immediately at our starting point, we
- * unconditionally move to STANDBY_INITIALIZED. After this point we
- * must do 4 things:
+ * Since locking on the primary when we derive the information is not strict,
+ * there is a time window between the derivation and writing to WAL of the
+ * derived information. That allows race conditions that we must resolve,
+ * since xids and locks may enter or leave the snapshot during that
+ * window. This creates the issue that an xid or lock may start *after* the
+ * snapshot has been derived yet *before* the snapshot is logged in the
+ * running xacts WAL record. We resolve this by starting to accumulate changes
+ * at a point just prior to when we collect the lock information on the
+ * primary, then ignore duplicates when we later apply the snapshot from the
+ * running xacts record. This is implemented during CreateCheckPoint() where
+ * we use the logical checkpoint location as our starting point and then write
+ * the running xacts record immediately before writing the main checkpoint WAL
+ * record. Since we always start up from a checkpoint's redo pointer, we will
+ * always see a running-xacts record between before reaching the checkpoint
+ * record, and can immediately enter hot standby mode. After this point we
+ * must do 3 things:
  *	* move shared nextXid forwards as we see new xids
  *	* extend the clog and subtrans with each new xid
- *	* keep track of uncommitted known assigned xids
  *	* keep track of uncommitted AccessExclusiveLocks
  *
- * When we see a commit/abort we must remove known assigned xids and locks
- * from the completing transaction. Attempted removals that cannot locate
- * an entry are expected and must not cause an error when we are in state
- * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
- * KnownAssignedXidsRemove().
- *
- * Later, when we apply the running xact data we must be careful to ignore
- * transactions already committed, since those commits raced ahead when
- * making WAL entries.
+ * When we see a commit/abort we must advance oldest_running_primary_xid and
+ * remove locks from the completing transaction. Attempted removals that
+ * cannot locate an entry are expected and must not cause an error until we
+ * have seen the running-xacts record. (We don't throw an error even after
+ * that, because whatever the reason was, after the transaction has completed
+ * the issue has already been resolved anyway.) This is implemented in
+ * StandbyReleaseLocks().
  *
  * For logical decoding only the running xacts information is needed;
  * there's no need to look at the locking information, but it's logged anyway,
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3df29658f18..aadec36dc15 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -140,6 +140,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_BUFFER] = "XactBuffer",
 	[LWTRANCHE_COMMITTS_BUFFER] = "CommitTsBuffer",
 	[LWTRANCHE_SUBTRANS_BUFFER] = "SubtransBuffer",
+	[LWTRANCHE_CSN_LOG_BUFFER] = "CsnLogBuffer",
 	[LWTRANCHE_MULTIXACTOFFSET_BUFFER] = "MultiXactOffsetBuffer",
 	[LWTRANCHE_MULTIXACTMEMBER_BUFFER] = "MultiXactMemberBuffer",
 	[LWTRANCHE_NOTIFY_BUFFER] = "NotifyBuffer",
@@ -178,6 +179,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
+	[LWTRANCHE_CSN_LOG_SLRU] = "CsnLogSLRU",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4f44648aca8..95e248b2c88 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -363,6 +363,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 XactBuffer	"Waiting for I/O on a transaction status SLRU buffer."
 CommitTsBuffer	"Waiting for I/O on a commit timestamp SLRU buffer."
 SubtransBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
+CsnlogBuffer	"Waiting for I/O on a sub-transaction SLRU buffer."
 MultiXactOffsetBuffer	"Waiting for I/O on a multixact offset SLRU buffer."
 MultiXactMemberBuffer	"Waiting for I/O on a multixact member SLRU buffer."
 NotifyBuffer	"Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..d8ff9cfdb36 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
 	probe clog__checkpoint__done(bool);
 	probe subtrans__checkpoint__start(bool);
 	probe subtrans__checkpoint__done(bool);
+	probe csnlog__checkpoint__start(bool);
+	probe csnlog__checkpoint__done(bool);
 	probe multixact__checkpoint__start(bool);
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5f9f2b9d8b2..049c706f2cf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -97,6 +97,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -1888,36 +1889,11 @@ XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot)
 	}
 	else
 	{
-		/*
-		 * In recovery we store all xids in the subxip array because it is by
-		 * far the bigger array, and we mostly don't know which xids are
-		 * top-level and which are subxacts. The xip array is empty.
-		 *
-		 * We start by searching subtrans, if we overflowed.
-		 */
-		if (snapshot->suboverflowed)
-		{
-			/*
-			 * Snapshot overflowed, so convert xid to top-level.  This is safe
-			 * because we eliminated too-old XIDs above.
-			 */
-			xid = SubTransGetTopmostTransaction(xid);
-
-			/*
-			 * If xid was indeed a subxact, we might now have an xid < xmin,
-			 * so recheck to avoid an array scan.  No point in rechecking
-			 * xmax.
-			 */
-			if (TransactionIdPrecedes(xid, snapshot->xmin))
-				return false;
-		}
+		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
 
-		/*
-		 * We now have either a top-level xid higher than xmin or an
-		 * indeterminate xid. We don't know whether it's top level or subxact
-		 * but it doesn't matter. If it's present, the xid is visible.
-		 */
-		if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
+		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
+			return false;
+		else
 			return true;
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c17fda2bc81..f52817e218f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -251,7 +251,8 @@ static const char *const subdirs[] = {
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
+	"pg_csn"
 };
 
 
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index a28d1667d4c..64fdd139173 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -146,6 +146,9 @@ static const char *const excludeDirContents[] =
 	/* Contents zeroed on startup, see StartupSUBTRANS(). */
 	"pg_subtrans",
 
+	/* Contents zeroed on startup, see StartupCSNLog(). */
+	"pg_csn",
+
 	/* end of list */
 	NULL
 };
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 00000000000..f8cdf573aef
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Mapping from XID to commit record's LSN (Commit Sequence Number).
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+						 TransactionId *subxids, XLogRecPtr csn);
+extern XLogRecPtr CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID, XLogRecPtr csn);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif							/* CSNLOG_H */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index e71c660118e..76411cca178 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -238,6 +238,9 @@ typedef struct TransamVariablesData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/* During recovery, LSN of latest replayed commit record */
+	XLogRecPtr	latestCommitLSN;
+
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
 	 * modified the database) that completed in some form since the start of
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa82355033..9527695886f 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -47,8 +47,7 @@ extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
 
-extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
-												 int *nxids_p);
+extern TransactionId PrescanPreparedTransactions(void);
 extern void StandbyRecoverPreparedTransactions(void);
 extern void RecoverPreparedTransactions(void);
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..b31944d0e6c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -171,7 +171,7 @@ typedef struct SavedTransactionCharacteristics
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
-#define XLOG_XACT_ASSIGNMENT		0x50
+/* 0x50 is unused, was XLOG_XACT_ASSIGNMENT */
 #define XLOG_XACT_INVALIDATIONS		0x60
 /* free opcode 0x70 */
 
@@ -215,15 +215,6 @@ typedef struct SavedTransactionCharacteristics
 #define XactCompletionForceSyncCommit(xinfo) \
 	((xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) != 0)
 
-typedef struct xl_xact_assignment
-{
-	TransactionId xtop;			/* assigned XID's top-level XID */
-	int			nsubxacts;		/* number of subtransaction XIDs */
-	TransactionId xsub[FLEXIBLE_ARRAY_MEMBER];	/* assigned subxids */
-} xl_xact_assignment;
-
-#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)
-
 /*
  * Commit and abort records can contain a lot of information. But a large
  * portion of the records won't need all possible pieces of information. So we
@@ -448,7 +439,6 @@ extern FullTransactionId GetTopFullTransactionId(void);
 extern FullTransactionId GetTopFullTransactionIdIfAny(void);
 extern FullTransactionId GetCurrentFullTransactionId(void);
 extern FullTransactionId GetCurrentFullTransactionIdIfAny(void);
-extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
 extern CommandId GetCurrentCommandId(bool used);
 extern void SetParallelStartTimestamps(TimestampTz xact_ts, TimestampTz stmt_ts);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index a1870d8e5aa..2ab20fcae2f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -27,37 +27,10 @@ extern PGDLLIMPORT bool ignore_invalid_pages;
 extern PGDLLIMPORT bool InRecovery;
 
 /*
- * Like InRecovery, standbyState is only valid in the startup process.
- * In all other processes it will have the value STANDBY_DISABLED (so
- * InHotStandby will read as false).
- *
- * In DISABLED state, we're performing crash recovery or hot standby was
- * disabled in postgresql.conf.
- *
- * In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
- * we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
- * to initialize our primary-transaction tracking system.
- *
- * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
- * state. The tracked information might still be incomplete, so we can't allow
- * connections yet, but redo functions must update the in-memory state when
- * appropriate.
- *
- * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
- * (or were) running on the primary at the current WAL location. Snapshots
- * can be taken, and read-only queries can be run.
+ * Like InRecovery, InHotStandby is only valid in the startup process.
+ * In all other processes it will be false.
  */
-typedef enum
-{
-	STANDBY_DISABLED,
-	STANDBY_INITIALIZED,
-	STANDBY_SNAPSHOT_PENDING,
-	STANDBY_SNAPSHOT_READY,
-} HotStandbyState;
-
-extern PGDLLIMPORT HotStandbyState standbyState;
-
-#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+extern PGDLLIMPORT bool InHotStandby;
 
 
 extern bool XLogHaveInvalidPages(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 4df1d25c045..457c5511c5e 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -181,6 +181,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
 	LWTRANCHE_COMMITTS_BUFFER,
 	LWTRANCHE_SUBTRANS_BUFFER,
+	LWTRANCHE_CSN_LOG_BUFFER,
 	LWTRANCHE_MULTIXACTOFFSET_BUFFER,
 	LWTRANCHE_MULTIXACTMEMBER_BUFFER,
 	LWTRANCHE_NOTIFY_BUFFER,
@@ -219,6 +220,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_AIO_URING_COMPLETION,
+	LWTRANCHE_CSN_LOG_SLRU,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 8eedc2d6b9f..57071d1e0f4 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -28,18 +28,11 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
+extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
-extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
-extern void ProcArrayApplyXidAssignment(TransactionId topxid,
-										int nsubxids, TransactionId *subxids);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
-extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
-												  int nsubxids, TransactionId *subxids,
-												  TransactionId max_xid);
-extern void ExpireAllKnownAssignedTransactionIds(void);
-extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
-extern void KnownAssignedTransactionIdsIdleMaintenance(void);
+extern void ProcArrayRecoveryEndTransaction(TransactionId max_xid, XLogRecPtr lsn);
 
 extern int	GetMaxSnapshotXidCount(void);
 extern int	GetMaxSnapshotSubxidCount(void);
@@ -56,7 +49,7 @@ extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
 extern TransactionId GetOldestTransactionIdConsideredRunning(void);
-extern TransactionId GetOldestActiveTransactionId(void);
+extern TransactionId GetOldestActiveTransactionId(bool allDbs);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin);
 
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 193366ce052..14ff80904c8 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
 #ifndef SNAPSHOT_H
 #define SNAPSHOT_H
 
+#include "access/xlogdefs.h"
 #include "lib/ilist.h"
 
 
@@ -186,6 +187,13 @@ typedef struct MVCCSnapshotSharedData
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
+	/*
+	 * MVCC snapshots taken during recovery use this CSN instead of the xip
+	 * and subxip arrays. Any transactions that committed at or before this
+	 * LSN are considered as visible.
+	 */
+	XLogRecPtr	snapshotCsn;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 
 	/*
-- 
2.39.5

v7-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patchapplication/octet-stream; name=v7-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch; x-unix-mode=0644Download

From 2565b8554e321e8ca9a87f36a48f9ab7f86ab247 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 20:01:07 +0300
Subject: [PATCH v6 10/12] Make SnapBuildWaitSnapshot work without
 xl_running_xacts.xids array

SnapBuildWaitSnapshot looped through all the XIDs in the
xl_running_xacts, waiting for them to finish. Change it to grab the
list of running XIDs from the proc array instead. This removes the
last usage of the XIDs array in the xl_running_xacts record, allowing
it to be removed in the next commit.

When SnapBuildWaitSnapshot() is called with running->nextXid as the
'cutoff' point, the new code should wait for exactly the same set of
transactions as before. But when called with initial_xmin_horizon as
the 'cutoff', this might wait for more transactions than before: those
between running->nextXid and initial_xmin_horizon. For example,
imagine that we see a running-xacts record with nextXid 100, and
initial_xmin_horizon is 200. Before, we would wait for all XIDs < 100
to complete, and then log the standby snapshot and proceed, but now we
will wait for all XIDs < 200. I believe that's a good thing, because
we won't actually be able to move to the next state in the snapshot
building until all transactions < 200 have completed. The
running-xacts snapshot that we logged after waiting up to XID 100
would not be useful to us either, if there are still XIDs between 100
and 200 running.

SnapBuildWaitSnapshot() used to do useless work when called in a
standby, because in a standby, there are no XID locks and the
XactLockTableWait() calls returned immediately, even if the XIDs were
in fact still running in the primary. But as the comment says, the
waiting isn't necessary for correctness, so that was harmless. In any
case, stop doing the futile work on a standby.
---
 src/backend/replication/logical/snapbuild.c | 50 ++++++++++++++-------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 97d278052df..252526ecf91 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -164,7 +164,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
 static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
-static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
+static void SnapBuildWaitSnapshot(TransactionId cutoff);
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
@@ -1192,14 +1192,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		NormalTransactionIdPrecedes(running->oldestRunningXid,
 									builder->initial_xmin_horizon))
 	{
+		TransactionId cutoff;
+
 		ereport(DEBUG1,
 				(errmsg_internal("skipping snapshot at %X/%X while building logical decoding snapshot, xmin horizon too low",
 								 LSN_FORMAT_ARGS(lsn)),
 				 errdetail_internal("initial xmin horizon of %u vs the snapshot's %u",
 									builder->initial_xmin_horizon, running->oldestRunningXid)));
 
-
-		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
+		cutoff = builder->initial_xmin_horizon;
+		TransactionIdRetreat(cutoff);
+		SnapBuildWaitSnapshot(cutoff);
 
 		return true;
 	}
@@ -1286,7 +1289,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1310,7 +1313,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						   running->xcnt, running->nextXid)));
 
-		SnapBuildWaitSnapshot(running, running->nextXid);
+		SnapBuildWaitSnapshot(running->nextXid);
 	}
 
 	/*
@@ -1343,8 +1346,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 }
 
 /* ---
- * Iterate through xids in record, wait for all older than the cutoff to
- * finish.  Then, if possible, log a new xl_running_xacts record.
+ * Wait for all transactions older than or equal to the cutoff to finish.
+ * Then, if possible, log a new xl_running_xacts record.
  *
  * This isn't required for the correctness of decoding, but to:
  * a) allow isolationtester to notice that we're currently waiting for
@@ -1354,13 +1357,31 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  * ---
  */
 static void
-SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
+SnapBuildWaitSnapshot(TransactionId cutoff)
 {
-	int			off;
+	RunningTransactions running;
+
+	if (RecoveryInProgress())
+	{
+		/*
+		 * During recovery, we have no mechanism for waiting for an XID to
+		 * finish, and we cannot create new running-xacts records either.
+		 */
+		return;
+	}
+
+	running = GetRunningTransactionData();
+
+	/*
+	 * GetRunningTransactionData returns with XidGenLock and ProcArrayLock
+	 * held, but we don't need them.
+	 */
+	LWLockRelease(XidGenLock);
+	LWLockRelease(ProcArrayLock);
 
-	for (off = 0; off < running->xcnt; off++)
+	for (int i = 0; i < running->xcnt; i++)
 	{
-		TransactionId xid = running->xids[off];
+		TransactionId xid = running->xids[i];
 
 		/*
 		 * Upper layers should prevent that we ever need to wait on ourselves.
@@ -1370,7 +1391,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 		if (TransactionIdIsCurrentTransactionId(xid))
 			elog(ERROR, "waiting for ourselves");
 
-		if (TransactionIdFollows(xid, cutoff))
+		if (TransactionIdFollowsOrEquals(xid, cutoff))
 			continue;
 
 		XactLockTableWait(xid, NULL, NULL, XLTW_None);
@@ -1382,10 +1403,7 @@ SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff)
 	 * wait for bgwriter or checkpointer to log one.  During recovery we can't
 	 * enforce that, so we'll have to wait.
 	 */
-	if (!RecoveryInProgress())
-	{
-		LogStandbySnapshot();
-	}
+	LogStandbySnapshot();
 }
 
 #define SnapBuildOnDiskConstantSize \
-- 
2.39.5

v7-0011-Remove-the-now-unused-xids-array-from-xl_running_.patchapplication/octet-stream; name=v7-0011-Remove-the-now-unused-xids-array-from-xl_running_.patch; x-unix-mode=0644Download

From 51212a4f053edb5e4ceef65e3ce5e722fbc3844b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 13 Aug 2024 16:40:57 +0300
Subject: [PATCH v6 11/12] Remove the now-unused xids array from
 xl_running_xacts

We still generate running-xacts records, because they are still needed
to initialize the snapshot in logical decoding.
---
 src/backend/access/rmgrdesc/standbydesc.c   | 18 ------------
 src/backend/replication/logical/snapbuild.c |  8 +++---
 src/backend/storage/ipc/standby.c           | 32 +++++----------------
 src/include/storage/standby.h               |  2 --
 src/include/storage/standbydefs.h           | 16 +++++++----
 5 files changed, 21 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 81eff5f31c4..5e6812396de 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -19,28 +19,10 @@
 static void
 standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 {
-	int			i;
-
 	appendStringInfo(buf, "nextXid %u latestCompletedXid %u oldestRunningXid %u",
 					 xlrec->nextXid,
 					 xlrec->latestCompletedXid,
 					 xlrec->oldestRunningXid);
-	if (xlrec->xcnt > 0)
-	{
-		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
-		for (i = 0; i < xlrec->xcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[i]);
-	}
-
-	if (xlrec->subxid_overflow)
-		appendStringInfoString(buf, "; subxid overflowed");
-
-	if (xlrec->subxcnt > 0)
-	{
-		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
-		for (i = 0; i < xlrec->subxcnt; i++)
-			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
-	}
 }
 
 void
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 252526ecf91..eada641d2a4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1286,8 +1286,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial starting point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
@@ -1310,8 +1310,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		ereport(LOG,
 				(errmsg("logical decoding found initial consistent point at %X/%X",
 						LSN_FORMAT_ARGS(lsn)),
-				 errdetail("Waiting for transactions (approximately %d) older than %u to end.",
-						   running->xcnt, running->nextXid)));
+				 errdetail("Waiting for transactions older than %u to end.",
+						   running->nextXid)));
 
 		SnapBuildWaitSnapshot(running->nextXid);
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 217b1670f5b..0f8a9aa0fea 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1337,9 +1337,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	xl_running_xacts xlrec;
 	XLogRecPtr	recptr;
 
-	xlrec.xcnt = CurrRunningXacts->xcnt;
-	xlrec.subxcnt = CurrRunningXacts->subxcnt;
-	xlrec.subxid_overflow = (CurrRunningXacts->subxid_status != SUBXIDS_IN_ARRAY);
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
 	xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid;
@@ -1347,31 +1344,16 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	/* Header */
 	XLogBeginInsert();
 	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
-	XLogRegisterData(&xlrec, MinSizeOfXactRunningXacts);
-
-	/* array of TransactionIds */
-	if (xlrec.xcnt > 0)
-		XLogRegisterData(CurrRunningXacts->xids,
-						 (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
+	XLogRegisterData(&xlrec, SizeOfXactRunningXacts);
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
-	if (xlrec.subxid_overflow)
-		elog(DEBUG2,
-			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
-	else
-		elog(DEBUG2,
-			 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
-			 LSN_FORMAT_ARGS(recptr),
-			 CurrRunningXacts->oldestRunningXid,
-			 CurrRunningXacts->latestCompletedXid,
-			 CurrRunningXacts->nextXid);
+	elog(DEBUG2,
+		 "logging running transaction bounds (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+		 LSN_FORMAT_ARGS(recptr),
+		 CurrRunningXacts->oldestRunningXid,
+		 CurrRunningXacts->latestCompletedXid,
+		 CurrRunningXacts->nextXid);
 
 	/*
 	 * Ensure running_xacts information is synced to disk not too far in the
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 24e2f5082bc..d73a8f58a73 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -60,8 +60,6 @@ extern void StandbyReleaseLockTree(TransactionId xid,
 extern void StandbyReleaseAllLocks(void);
 extern void StandbyReleaseOldLocks(TransactionId oldxid);
 
-#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
-
 
 /*
  * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index 71e5ae878b5..3d182b66e74 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -42,20 +42,24 @@ typedef struct xl_standby_locks
 } xl_standby_locks;
 
 /*
- * When we write running xact data to WAL, we use this structure.
+ * Data included in an XLOG_RUNNING_XACTS record.
+ *
+ * This used to include a list of running XIDs, hence the name, but nowadays
+ * this only contains the min and max bounds of the transactions that were
+ * running when the record was written.  They are needed to initialize logical
+ * decoding.  They are also used in hot standby to prune information about old
+ * running transactions, in case the the primary didn't write a COMMIT/ABORT
+ * record for some reason.
  */
 typedef struct xl_running_xacts
 {
-	int			xcnt;			/* # of xact ids in xids[] */
-	int			subxcnt;		/* # of subxact ids in xids[] */
-	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* xid from TransamVariables->nextXid */
 	TransactionId oldestRunningXid; /* *not* oldestXmin */
 	TransactionId latestCompletedXid;	/* so we can set xmax */
-
-	TransactionId xids[FLEXIBLE_ARRAY_MEMBER];
 } xl_running_xacts;
 
+#define SizeOfXactRunningXacts sizeof(xl_running_xacts)
+
 /*
  * Invalidations for standby, currently only when transactions without an
  * assigned xid commit.
-- 
2.39.5

v7-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patchapplication/octet-stream; name=v7-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch; x-unix-mode=0644Download

From 6b8e856c15750f89f9d559ae9f9fbd7f3f2db125 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 1 Apr 2025 00:18:14 +0300
Subject: [PATCH v6 12/12] Add a cache to Snapshot to avoid repeated CSN
 lookups

Cache the status of all XIDs that have been looked up in the CSN log
in the SnapshotData. This avoids having to go the CSN log in the
common case that the same XIDs are looked up over and over again.
---
 src/backend/utils/time/snapmgr.c | 111 +++++++++++++++++++++++++++++--
 src/include/utils/snapshot.h     |   9 +++
 2 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 049c706f2cf..250ba1650e4 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -114,6 +114,35 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Define a radix tree implementation to cache CSN lookups in a snapshot.
+ *
+ * We need only one bit of information for each XID stored in the cache: was
+ * the XID running or not.  However, the radix tree implementation uses 8
+ * bytes for each entry (on 64-bit machines) even if the value type is smaller
+ * than that.  To reduce memory usage, we use uint64 as the value type, and
+ * store multiple XIDs in each value.
+ *
+ * The 64-bit value word holds two bits for each XID: whether the XID is
+ * present in the cache or not, and if it's present, whether it's considered
+ * as in-progress by the snapshot or not.  So each entry in the radix tree
+ * holds the status for 32 XIDs.
+ */
+#define RT_PREFIX inprogress_cache
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define INPROGRESS_CACHE_BITS 2
+#define INPROGRESS_CACHE_XIDS_PER_WORD 32
+
+#define INPROGRESS_CACHE_XID_IS_CACHED(word, slotno) \
+	((((word) & (UINT64CONST(1) << (slotno)))) != 0)
+
+#define INPROGRESS_CACHE_XID_IS_IN_PROGRESS(word, slotno) \
+	((((word) & (UINT64CONST(1) << ((slotno) + 1)))) != 0)
 
 /*
  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
@@ -240,6 +269,7 @@ typedef struct SerializedSnapshotData
 	int32		subxcnt;
 	bool		suboverflowed;
 	bool		takenDuringRecovery;
+	XLogRecPtr	snapshotCsn;
 	CommandId	curcid;
 } SerializedSnapshotData;
 
@@ -1177,6 +1207,7 @@ ExportSnapshot(MVCCSnapshotShared snapshot)
 			appendStringInfo(&buf, "sxp:%u\n", children[i]);
 	}
 	appendStringInfo(&buf, "rec:%u\n", snapshot->takenDuringRecovery);
+	appendStringInfo(&buf, "snapshotcsn:%X/%X\n", LSN_FORMAT_ARGS(snapshot->snapshotCsn));
 
 	/*
 	 * Now write the text representation into a file.  We first write to a
@@ -1449,6 +1480,7 @@ ImportSnapshot(const char *idstr)
 	}
 
 	snapshot->takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
+	snapshot->snapshotCsn = parseIntFromText("snapshotcsn:", &filebuf, path);
 
 	snapshot->refcount = 1;
 	valid_snapshots_push_out_of_order(snapshot);
@@ -1702,6 +1734,7 @@ SerializeSnapshot(MVCCSnapshot snapshot, char *start_address)
 	serialized_snapshot.subxcnt = snapshot->shared->subxcnt;
 	serialized_snapshot.suboverflowed = snapshot->shared->suboverflowed;
 	serialized_snapshot.takenDuringRecovery = snapshot->shared->takenDuringRecovery;
+	serialized_snapshot.snapshotCsn = snapshot->shared->snapshotCsn;
 	serialized_snapshot.curcid = snapshot->curcid;
 
 	/*
@@ -1776,6 +1809,9 @@ RestoreSnapshot(char *start_address)
 	snapshot->shared->subxcnt = serialized_snapshot.subxcnt;
 	snapshot->shared->suboverflowed = serialized_snapshot.suboverflowed;
 	snapshot->shared->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
+	snapshot->shared->snapshotCsn = serialized_snapshot.snapshotCsn;
+	snapshot->shared->inprogress_cache = NULL;
+	snapshot->shared->inprogress_cache_cxt = NULL;
 	snapshot->shared->snapXactCompletionCount = 0;
 
 	snapshot->shared->refcount = 1;
@@ -1889,12 +1925,62 @@ XidInMVCCSnapshot(TransactionId xid, MVCCSnapshotShared snapshot)
 	}
 	else
 	{
-		XLogRecPtr	csn = CSNLogGetCSNByXid(xid);
+		XLogRecPtr	csn;
+		bool		inprogress;
+		uint64	   *cache_entry;
+		uint64		cache_word = 0;
 
-		if (csn != InvalidXLogRecPtr && csn <= snapshot->snapshotCsn)
-			return false;
+		/*
+		 * Calculate the word and bit slot for the XID in the cache. We use an
+		 * offset from xmax as the key instead of the XID directly, because
+		 * the radix tree can compact away leading zeros and is thus more
+		 * efficient with keys closer to 0.
+		 */
+		uint32		cache_idx = snapshot->xmax - xid;
+		uint64		wordno = cache_idx / INPROGRESS_CACHE_XIDS_PER_WORD;
+		uint64		slotno = (cache_idx % INPROGRESS_CACHE_XIDS_PER_WORD) * INPROGRESS_CACHE_BITS;
+
+		if (snapshot->inprogress_cache)
+		{
+			cache_entry = inprogress_cache_find(snapshot->inprogress_cache, wordno);
+			if (cache_entry)
+			{
+				cache_word = *cache_entry;
+				if (INPROGRESS_CACHE_XID_IS_CACHED(cache_word, slotno))
+					return INPROGRESS_CACHE_XID_IS_IN_PROGRESS(cache_word, slotno);
+			}
+		}
 		else
-			return true;
+		{
+			MemoryContext save_cxt;
+
+			save_cxt = MemoryContextSwitchTo(TopMemoryContext);
+
+			if (snapshot->inprogress_cache_cxt == NULL)
+				snapshot->inprogress_cache_cxt =
+					AllocSetContextCreate(TopMemoryContext,
+										  "snapshot inprogress cache context",
+										  ALLOCSET_SMALL_SIZES);
+			snapshot->inprogress_cache = inprogress_cache_create(snapshot->inprogress_cache_cxt);
+			cache_entry = NULL;
+			MemoryContextSwitchTo(save_cxt);
+		}
+
+		/* Not found in cache, look up the CSN */
+		csn = CSNLogGetCSNByXid(xid);
+		inprogress = (csn == InvalidXLogRecPtr || csn > snapshot->snapshotCsn);
+
+		/* Update the cache word, and store it back to the radix tree */
+		cache_word |= UINT64CONST(1) << slotno; /* cached */
+		if (inprogress)
+			cache_word |= UINT64CONST(1) << (slotno + 1);	/* in-progress */
+
+		if (cache_entry)
+			*cache_entry = cache_word;
+		else
+			inprogress_cache_set(snapshot->inprogress_cache, wordno, &cache_word);
+
+		return inprogress;
 	}
 
 	return false;
@@ -1944,6 +2030,9 @@ AllocMVCCSnapshotShared(void)
 
 	shared->snapXactCompletionCount = 0;
 	shared->refcount = 0;
+	shared->snapshotCsn = InvalidXLogRecPtr;
+	shared->inprogress_cache = NULL;
+	shared->inprogress_cache_cxt = NULL;
 
 	MemoryContextSwitchTo(save_cxt);
 
@@ -1972,8 +2061,22 @@ void
 FreeMVCCSnapshotShared(MVCCSnapshotShared shared)
 {
 	Assert(shared->refcount == 0);
+
+	if (shared->inprogress_cache)
+	{
+		inprogress_cache_free(shared->inprogress_cache);
+		shared->inprogress_cache = NULL;
+	}
+	if (shared->inprogress_cache_cxt)
+	{
+		MemoryContextDelete(shared->inprogress_cache_cxt);
+		shared->inprogress_cache_cxt = NULL;
+	}
+
 	if (spareSnapshotShared == NULL)
+	{
 		spareSnapshotShared = shared;
+	}
 	else
 		pfree(shared);
 }
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 14ff80904c8..edf5bf1ba0a 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -129,6 +129,8 @@ typedef enum MVCCSnapshotKind
 	SNAPSHOT_REGISTERED,
 } MVCCSnapshotKind;
 
+struct inprogress_cache_radix_tree; /* private to snapmgr.c */
+
 /*
  * Struct representing a normal MVCC snapshot.
  *
@@ -194,6 +196,13 @@ typedef struct MVCCSnapshotSharedData
 	 */
 	XLogRecPtr	snapshotCsn;
 
+	/*
+	 * Cache of XIDs known to be running or not according to the snapshot.
+	 * Used in snapshots taken during recovery.
+	 */
+	struct inprogress_cache_radix_tree *inprogress_cache;
+	MemoryContext inprogress_cache_cxt;
+
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 
 	/*
-- 
2.39.5

v7-0013-use-clog-in-XidInMVCCSnapshot.patchapplication/octet-stream; name=v7-0013-use-clog-in-XidInMVCCSnapshot.patch; x-unix-mode=0644Download

From 5cbb8e5806a444df04ed2f54d1c45f4aa1ea8f7b Mon Sep 17 00:00:00 2001
From: Mingwei Jia <wei19860922@163.com>
Date: Sat, 12 Apr 2025 21:10:44 +0800
Subject: [PATCH] If the checkpoint's oldestActiveXid at standby startup is
 less than nextXid, then the visibility of transactions within that range
 should be determined using the CLOG. This is because there may be
 transactions in that range that have already committed on the primary, and
 these committed transactions will not be replayed again on the standby.

---
 src/backend/access/transam/xlog.c             |  2 ++
 src/backend/storage/ipc/procarray.c           |  4 +++
 src/backend/utils/time/snapmgr.c              | 12 ++++++++
 src/include/access/transam.h                  |  2 ++
 src/include/storage/procarray.h               |  2 ++
 .../modules/test_misc/t/008_csnstandby.pl     | 30 +++++++++++++++++++
 6 files changed, 52 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/008_csnstandby.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index deb2cd1883c..2796458491d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5565,6 +5565,8 @@ StartupXLOG(void)
 	TransamVariables->nextXid = checkPoint.nextXid;
 	TransamVariables->nextOid = checkPoint.nextOid;
 	TransamVariables->oidCount = 0;
+	TransamVariables->nextXidStandbyStart =
+					XidFromFullTransactionId(checkPoint.nextXid);
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c82e8d8c438..d4c67512c56 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -962,6 +962,10 @@ ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
 }
+TransactionId GetOldestRunningXid(void)
+{
+	return 	procArray->oldest_running_primary_xid;
+}
 
 
 /*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index df9e8ba37f4..0ebcb0af835 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1969,6 +1969,18 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		uint32		cache_idx = snapshot->xmax - xid;
 		uint64		wordno = cache_idx / INPROGRESS_CACHE_XIDS_PER_WORD;
 		uint64		slotno = (cache_idx % INPROGRESS_CACHE_XIDS_PER_WORD) * INPROGRESS_CACHE_BITS;
+		TransactionId nextXidStart = TransamVariables->nextXidStandbyStart;
+		TransactionId oldestRunning = GetOldestRunningXid();
+
+		if (TransactionIdPrecedes(oldestRunning, nextXidStart)
+			&& TransactionIdPrecedes(xid, nextXidStart))
+		{
+			if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			{
+				return false;
+			}
+			return true;
+		}
 
 		if (snapshot->inprogress_cache)
 		{
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index a7054fe11cd..a75d66f4d40 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -240,6 +240,8 @@ typedef struct TransamVariablesData
 
 	/* During recovery, LSN of latest replayed commit record */
 	XLogRecPtr	latestCommitLSN;
+	/* checkpoint`s next xid when hot-standby start */
+	TransactionId nextXidStandbyStart;
 
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index de74fce24e4..a4129de8101 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -29,6 +29,8 @@ extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
 extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
+
+extern TransactionId GetOldestRunningXid(void);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
diff --git a/src/test/modules/test_misc/t/008_csnstandby.pl b/src/test/modules/test_misc/t/008_csnstandby.pl
new file mode 100644
index 00000000000..98708490d91
--- /dev/null
+++ b/src/test/modules/test_misc/t/008_csnstandby.pl
@@ -0,0 +1,30 @@
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->start;
+my $count = '';
+$primary->safe_psql('postgres', 'create table t1(i int, j int)');
+
+my $primary_a =  $primary->background_psql('postgres', on_error_die => 1);
+$primary_a->query_safe("begin");
+$primary_a->query_safe("insert into t1 values(1,1)");
+$primary->safe_psql('postgres', 'insert into t1 values(2,1)');
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->start;
+
+$count = $replica->safe_psql('postgres', "select count(*) from t1");
+is($count, '1', "get right visiablity before primary checkpoint in hot standby");
+
+done_testing();
+
--
2.45.0

#15

贾明伟

wei19860922@163.com

9 months ago

In reply to: 贾明伟 (#14)

1 attachment(s)

Re: CSN snapshots in hot standby

Hi all,

Apologies — the patch I sent earlier did not appear as expected on the mailing list archives because of wrong attachment style.

I'll resend it properly as an inline patch shortly.

Thanks for your understanding!

Best regards,
Mingwei Jia

Attachments:

v7-0013-use-clog-in-XidInMVCCSnapshot.patchapplication/octet-stream; name=v7-0013-use-clog-in-XidInMVCCSnapshot.patch; x-unix-mode=0644Download

From 5cbb8e5806a444df04ed2f54d1c45f4aa1ea8f7b Mon Sep 17 00:00:00 2001
From: Mingwei Jia <wei19860922@163.com>
Date: Sat, 12 Apr 2025 21:10:44 +0800
Subject: [PATCH] If the checkpoint's oldestActiveXid at standby startup is
 less than nextXid, then the visibility of transactions within that range
 should be determined using the CLOG. This is because there may be
 transactions in that range that have already committed on the primary, and
 these committed transactions will not be replayed again on the standby.

---
 src/backend/access/transam/xlog.c             |  2 ++
 src/backend/storage/ipc/procarray.c           |  4 +++
 src/backend/utils/time/snapmgr.c              | 12 ++++++++
 src/include/access/transam.h                  |  2 ++
 src/include/storage/procarray.h               |  2 ++
 .../modules/test_misc/t/008_csnstandby.pl     | 30 +++++++++++++++++++
 6 files changed, 52 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/008_csnstandby.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index deb2cd1883c..2796458491d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5565,6 +5565,8 @@ StartupXLOG(void)
 	TransamVariables->nextXid = checkPoint.nextXid;
 	TransamVariables->nextOid = checkPoint.nextOid;
 	TransamVariables->oidCount = 0;
+	TransamVariables->nextXidStandbyStart =
+					XidFromFullTransactionId(checkPoint.nextXid);
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	AdvanceOldestClogXid(checkPoint.oldestXid);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c82e8d8c438..d4c67512c56 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -962,6 +962,10 @@ ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID)
 	procArray->oldest_running_primary_xid = oldestRunningXID;
 	LWLockRelease(ProcArrayLock);
 }
+TransactionId GetOldestRunningXid(void)
+{
+	return 	procArray->oldest_running_primary_xid;
+}
 
 
 /*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index df9e8ba37f4..0ebcb0af835 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1969,6 +1969,18 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		uint32		cache_idx = snapshot->xmax - xid;
 		uint64		wordno = cache_idx / INPROGRESS_CACHE_XIDS_PER_WORD;
 		uint64		slotno = (cache_idx % INPROGRESS_CACHE_XIDS_PER_WORD) * INPROGRESS_CACHE_BITS;
+		TransactionId nextXidStart = TransamVariables->nextXidStandbyStart;
+		TransactionId oldestRunning = GetOldestRunningXid();
+
+		if (TransactionIdPrecedes(oldestRunning, nextXidStart)
+			&& TransactionIdPrecedes(xid, nextXidStart))
+		{
+			if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			{
+				return false;
+			}
+			return true;
+		}
 
 		if (snapshot->inprogress_cache)
 		{
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index a7054fe11cd..a75d66f4d40 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -240,6 +240,8 @@ typedef struct TransamVariablesData
 
 	/* During recovery, LSN of latest replayed commit record */
 	XLogRecPtr	latestCommitLSN;
+	/* checkpoint`s next xid when hot-standby start */
+	TransactionId nextXidStandbyStart;
 
 	/*
 	 * Number of top-level transactions with xids (i.e. which may have
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index de74fce24e4..a4129de8101 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -29,6 +29,8 @@ extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);
 
 extern void ProcArrayUpdateOldestRunningXid(TransactionId oldestRunningXID);
+
+extern TransactionId GetOldestRunningXid(void);
 extern void ProcArrayInitRecovery(TransactionId initializedUptoXID);
 
 extern void RecordKnownAssignedTransactionIds(TransactionId xid);
diff --git a/src/test/modules/test_misc/t/008_csnstandby.pl b/src/test/modules/test_misc/t/008_csnstandby.pl
new file mode 100644
index 00000000000..98708490d91
--- /dev/null
+++ b/src/test/modules/test_misc/t/008_csnstandby.pl
@@ -0,0 +1,30 @@
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init();
+$primary->append_conf('postgresql.conf', 'max_wal_senders = 5');
+$primary->append_conf('postgresql.conf', 'wal_level=replica');
+$primary->start;
+my $count = '';
+$primary->safe_psql('postgres', 'create table t1(i int, j int)');
+
+my $primary_a =  $primary->background_psql('postgres', on_error_die => 1);
+$primary_a->query_safe("begin");
+$primary_a->query_safe("insert into t1 values(1,1)");
+$primary->safe_psql('postgres', 'insert into t1 values(2,1)');
+$primary->backup('bkp');
+
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($primary, 'bkp', has_streaming => 1);
+$replica->start;
+
+$count = $replica->safe_psql('postgres', "select count(*) from t1");
+is($count, '1', "get right visiablity before primary checkpoint in hot standby");
+
+done_testing();
+
--
2.45.0