POC and rebased patch for CSN based snapshots
Hello hackers,
I have read the community mail from 'postgrespro' which the link
below ①, a summary for the patch, it generals a CSN by timestamp
when a transaction is committed and assigns a special value as CSN
for abort transaction, and record them in CSN SLRU file. Now we can
judge if a xid available in a snapshot with a CSN value instead of by
xmin,xmax and xip array so that if we hold CSN as a snapshot which
can be export and import.
CSN may be a correct direction and an important part to implement
distributed of PostgreSQL because it delivers few data among cross-nodes
for snapshot, so the patch is meant to do some research.
We want to implement Clock-SI base on the patch.However the patch
is too old, and I rebase the infrastructure part of the patch to recently
commit(7dc37ccea85).
The origin patch does not support csn alive among database restart
because it will clean csnlog at every time the database restart, it works
well until a prepared transaction occurs due to the csn of prepare
transaction cleaned by a database restart. So I add wal support for
csnlog then csn can alive all the time, and move the csnlog clean work
to auto vacuum.
It comes to another issue, now it can't switch from a xid-base snapshot
to csn-base snapshot if a prepare transaction exists because it can not
find csn for the prepare transaction produced during xid-base snapshot.
To solve it, if the database restart with snapshot change to csn-base I
record an 'xmin_for_csn' where start to check with csn snapshot.
Some issues known about the current patch:
1. The CSN-snapshot support repeatable read isolation level only, we
should try to support other isolation levels.
2. We can not switch fluently from xid-base->csn-base, if there be prepared
transaction in database.
What do you think about it, I want try to test and improve the patch step
by step.
①/messages/by-id/21BC916B-80A1-43BF-8650-3363CCDAE09C@postgrespro.ru%C2%A0
-----------
Regards,
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca/
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Attachments:
0001-CSN-base-snapshot.patchapplication/octet-stream; name=0001-CSN-base-snapshot.patchDownload
From:7dc37ccea8599f460ec95b8a0208e2047a6fb4bf
src/backend/access/transam/Makefile | 2
src/backend/access/transam/csn_log.c | 438 +++++++++++++++++++++++++++++
src/backend/access/transam/csn_snapshot.c | 340 +++++++++++++++++++++++
src/backend/access/transam/twophase.c | 28 ++
src/backend/access/transam/varsup.c | 2
src/backend/access/transam/xact.c | 29 ++
src/backend/access/transam/xlog.c | 12 +
src/backend/storage/ipc/ipci.c | 6
src/backend/storage/ipc/procarray.c | 32 ++
src/backend/storage/lmgr/lwlocknames.txt | 1
src/backend/storage/lmgr/proc.c | 3
src/backend/utils/misc/guc.c | 10 +
src/backend/utils/probes.d | 2
src/backend/utils/time/snapmgr.c | 49 +++
src/bin/initdb/initdb.c | 3
src/include/access/csn_log.h | 30 ++
src/include/access/csn_snapshot.h | 58 ++++
src/include/datatype/timestamp.h | 3
src/include/fmgr.h | 1
src/include/portability/instr_time.h | 10 +
src/include/storage/lwlock.h | 1
src/include/storage/proc.h | 12 +
src/include/storage/procarray.h | 2
src/include/utils/snapshot.h | 9 +
src/test/regress/expected/sysviews.out | 3
25 files changed, 1081 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..fc0321ee6b 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,8 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
clog.o \
commit_ts.o \
+ csn_log.o \
+ csn_snapshot.o \
generic_xlog.o \
multixact.o \
parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..4e0b8d64e4
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,438 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ * Track commit sequence numbers of finished transactions
+ *
+ * This module provides SLRU to store CSN for each transaction. This
+ * mapping need to be kept only for xid's greater then oldestXid, but
+ * that can require arbitrary large amounts of memory in case of long-lived
+ * transactions. Because of same lifetime and persistancy requirements
+ * this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool enable_csn_snapshot;
+
+/*
+ * Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XidCSN))
+
+#define TransactionIdToPage(xid) ((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int page1, int page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
+ int slotno);
+
+/*
+ * CSNLogSetCSN
+ *
+ * Record XidCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should be
+ * AbortedCSN for abort cases.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ int pageno;
+ int i = 0;
+ int offset = 0;
+
+ /* Callers of CSNLogSetCSN() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ Assert(TransactionIdIsValid(xid));
+
+ pageno = TransactionIdToPage(xid); /* get page of parent */
+ for (;;)
+ {
+ int num_on_page = 0;
+
+ while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+ {
+ num_on_page++;
+ i++;
+ }
+
+ CSNLogSetPageStatus(xid,
+ num_on_page, subxids + offset,
+ csn, pageno);
+ if (i >= nsubxids)
+ break;
+
+ offset = i;
+ pageno = TransactionIdToPage(subxids[offset]);
+ xid = InvalidTransactionId;
+ }
+}
+
+/*
+ * Record the final state of transaction entries in the csn log for
+ * all entries on a single page. Atomic only on this page.
+ *
+ * Otherwise API is same as TransactionIdSetTreeStatus()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno)
+{
+ int slotno;
+ int i;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+ /* Subtransactions first, if needed ... */
+ for (i = 0; i < nsubxids; i++)
+ {
+ Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+ CSNLogSetCSNInSlot(subxids[i], csn, slotno);
+ }
+
+ /* ... then the main transaction */
+ if (TransactionIdIsValid(xid))
+ CSNLogSetCSNInSlot(xid, csn, slotno);
+
+ CsnlogCtl->shared->page_dirty[slotno] = true;
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
+{
+ int entryno = TransactionIdToPgIndex(xid);
+ XidCSN *ptr;
+
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+ *ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XidCSN
+CSNLogGetCSNByXid(TransactionId xid)
+{
+ int pageno = TransactionIdToPage(xid);
+ int entryno = TransactionIdToPgIndex(xid);
+ int slotno;
+ XidCSN *ptr;
+ XidCSN xid_csn;
+
+ /* Callers of CSNLogGetCSNByXid() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ /* Can't ask about stuff that might not be around anymore */
+ Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+ slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+ xid_csn = *ptr;
+
+ LWLockRelease(CSNLogControlLock);
+
+ return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+ return Min(32, Max(4, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+ if (!enable_csn_snapshot)
+ return 0;
+
+ return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+ SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+ CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+}
+
+/*
+ * This func must be called ONCE on system install. It creates the initial
+ * CSNLog segment. The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ */
+void
+BootStrapCSNLog(void)
+{
+ int slotno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Create and zero the first page of the commit log */
+ slotno = ZeroCSNLogPage(0);
+
+ /* Make sure it's written out */
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+ return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID)
+{
+ int startPage;
+ int endPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Since we don't expect pg_csn to be valid across crashes, we
+ * initialize the currently-active page(s) to zeroes during startup.
+ * Whenever we advance into a new page, ExtendCSNLog will likewise
+ * zero the new page without regard to whatever was previously on disk.
+ */
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ startPage = TransactionIdToPage(oldestActiveXID);
+ endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
+
+ while (startPage != endPage)
+ {
+ (void) ZeroCSNLogPage(startPage);
+ startPage++;
+ /* must account for wraparound */
+ if (startPage > TransactionIdToPage(MaxTransactionId))
+ startPage = 0;
+ }
+ (void) ZeroCSNLogPage(startPage);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely as a debugging aid.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+ SimpleLruFlush(CsnlogCtl, false);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely to improve the odds that writing of dirty pages is done by
+ * the checkpoint process and not by backends.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+ SimpleLruFlush(CsnlogCtl, true);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock. We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+ int pageno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * No work except at first XID of a page. But beware: just after
+ * wraparound, the first XID of page zero is FirstNormalTransactionId.
+ */
+ if (TransactionIdToPgIndex(newestXact) != 0 &&
+ !TransactionIdEquals(newestXact, FirstNormalTransactionId))
+ return;
+
+ pageno = TransactionIdToPage(newestXact);
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Zero the page and make an XLOG entry about it */
+ ZeroCSNLogPage(pageno);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+ int cutoffPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * The cutoff point is the start of the segment containing oldestXact. We
+ * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+ * back one transaction to avoid passing a cutoff page that hasn't been
+ * created yet in the rare case that oldestXact would be the first item on
+ * a page and oldestXact == next XID. In that case, if we didn't subtract
+ * one, we'd trigger SimpleLruTruncate's wraparound detection.
+ */
+ TransactionIdRetreat(oldestXact);
+ cutoffPage = TransactionIdToPage(oldestXact);
+
+ SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic. However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs. So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int page1, int page2)
+{
+ TransactionId xid1;
+ TransactionId xid2;
+
+ xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+ xid1 += FirstNormalTransactionId;
+ xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+ xid2 += FirstNormalTransactionId;
+
+ return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
new file mode 100644
index 0000000000..e2d4d2649e
--- /dev/null
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -0,0 +1,340 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.c
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_snapshot.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
+#include "access/transam.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "portability/instr_time.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+#include "miscadmin.h"
+
+/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
+#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+
+/*
+ * CSNSnapshotState
+ *
+ * Do not trust local clocks to be strictly monotonical and save last acquired
+ * value so later we can compare next timestamp with it. Accessed through
+ * GenerateCSN().
+ */
+typedef struct
+{
+ SnapshotCSN last_max_csn;
+ volatile slock_t lock;
+} CSNSnapshotState;
+
+static CSNSnapshotState *csnState;
+
+/*
+ * Enables this module.
+ */
+extern bool enable_csn_snapshot;
+
+
+/* Estimate shared memory space needed */
+Size
+CSNSnapshotShmemSize(void)
+{
+ Size size = 0;
+
+ if (enable_csn_snapshot)
+ {
+ size += MAXALIGN(sizeof(CSNSnapshotState));
+ }
+
+ return size;
+}
+
+/* Init shared memory structures */
+void
+CSNSnapshotShmemInit()
+{
+ bool found;
+
+ if (enable_csn_snapshot)
+ {
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
+ {
+ csnState->last_max_csn = 0;
+ SpinLockInit(&csnState->lock);
+ }
+ }
+}
+
+/*
+ * GenerateCSN
+ *
+ * Generate SnapshotCSN which is actually a local time. Also we are forcing
+ * this time to be always increasing. Since now it is not uncommon to have
+ * millions of read transactions per second we are trying to use nanoseconds
+ * if such time resolution is available.
+ */
+SnapshotCSN
+GenerateCSN(bool locked)
+{
+ instr_time current_time;
+ SnapshotCSN csn;
+
+ Assert(enable_csn_snapshot || csn_snapshot_defer_time > 0);
+
+ /*
+ * TODO: create some macro that add small random shift to current time.
+ */
+ INSTR_TIME_SET_CURRENT(current_time);
+ csn = (SnapshotCSN) INSTR_TIME_GET_NANOSEC(current_time);
+
+ /* TODO: change to atomics? */
+ if (!locked)
+ SpinLockAcquire(&csnState->lock);
+
+ if (csn <= csnState->last_max_csn)
+ csn = ++csnState->last_max_csn;
+ else
+ csnState->last_max_csn = csn;
+
+ if (!locked)
+ SpinLockRelease(&csnState->lock);
+
+ return csn;
+}
+
+/*
+ * TransactionIdGetXidCSN
+ *
+ * Get XidCSN for specified TransactionId taking care about special xids,
+ * xids beyond TransactionXmin and InDoubt states.
+ */
+XidCSN
+TransactionIdGetXidCSN(TransactionId xid)
+{
+ XidCSN xid_csn;
+
+ Assert(enable_csn_snapshot);
+
+ /* Handle permanent TransactionId's for which we don't have mapping */
+ if (!TransactionIdIsNormal(xid))
+ {
+ if (xid == InvalidTransactionId)
+ return AbortedXidCSN;
+ if (xid == FrozenTransactionId || xid == BootstrapTransactionId)
+ return FrozenXidCSN;
+ Assert(false); /* Should not happend */
+ }
+
+ /*
+ * For xids which less then TransactionXmin CSNLog can be already
+ * trimmed but we know that such transaction is definetly not concurrently
+ * running according to any snapshot including timetravel ones. Callers
+ * should check TransactionDidCommit after.
+ */
+ if (TransactionIdPrecedes(xid, TransactionXmin))
+ return FrozenXidCSN;
+
+ /* Read XidCSN from SLRU */
+ xid_csn = CSNLogGetCSNByXid(xid);
+
+ /*
+ * If we faced InDoubt state then transaction is beeing committed and we
+ * should wait until XidCSN will be assigned so that visibility check
+ * could decide whether tuple is in snapshot. See also comments in
+ * CSNSnapshotPrecommit().
+ */
+ if (XidCSNIsInDoubt(xid_csn))
+ {
+ XactLockTableWait(xid, NULL, NULL, XLTW_None);
+ xid_csn = CSNLogGetCSNByXid(xid);
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+ }
+
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsInProgress(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+
+ return xid_csn;
+}
+
+/*
+ * XidInvisibleInCSNSnapshot
+ *
+ * Version of XidInMVCCSnapshot for transactions. For non-imported
+ * csn snapshots this should give same results as XidInLocalMVCCSnapshot
+ * (except that aborts will be shown as invisible without going to clog) and to
+ * ensure such behaviour XidInMVCCSnapshot is coated with asserts that checks
+ * identicalness of XidInvisibleInCSNSnapshot/XidInLocalMVCCSnapshot in
+ * case of ordinary snapshot.
+ */
+bool
+XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ XidCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ csn = TransactionIdGetXidCSN(xid);
+
+ if (XidCSNIsNormal(csn))
+ {
+ if (csn < snapshot->snapshot_csn)
+ return false;
+ else
+ return true;
+ }
+ else if (XidCSNIsFrozen(csn))
+ {
+ /* It is bootstrap or frozen transaction */
+ return false;
+ }
+ else
+ {
+ /* It is aborted or in-progress */
+ Assert(XidCSNIsAborted(csn) || XidCSNIsInProgress(csn));
+ if (XidCSNIsAborted(csn))
+ Assert(TransactionIdDidAbort(xid));
+ return true;
+ }
+}
+
+
+/*****************************************************************************
+ * Functions to handle transactions commit.
+ *
+ * For local transactions CSNSnapshotPrecommit sets InDoubt state before
+ * ProcArrayEndTransaction is called and transaction data potetntially becomes
+ * visible to other backends. ProcArrayEndTransaction (or ProcArrayRemove in
+ * twophase case) then acquires xid_csn under ProcArray lock and stores it
+ * in proc->assignedXidCsn. It's important that xid_csn for commit is
+ * generated under ProcArray lock, otherwise snapshots won't
+ * be equivalent. Consequent call to CSNSnapshotCommit will write
+ * proc->assignedXidCsn to CSNLog.
+ *
+ *
+ * CSNSnapshotAbort is slightly different comparing to commit because abort
+ * can skip InDoubt phase and can be called for transaction subtree.
+ *****************************************************************************/
+
+
+/*
+ * CSNSnapshotAbort
+ *
+ * Abort transaction in CsnLog. We can skip InDoubt state for aborts
+ * since no concurrent transactions allowed to see aborted data anyway.
+ */
+void
+CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+
+ /*
+ * Clean assignedXidCsn anyway, as it was possibly set in
+ * XidSnapshotAssignCsnCurrent.
+ */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
+
+/*
+ * CSNSnapshotPrecommit
+ *
+ * Set InDoubt status for local transaction that we are going to commit.
+ * This step is needed to achieve consistency between local snapshots and
+ * csn-based snapshots. We don't hold ProcArray lock while writing
+ * csn for transaction in SLRU but instead we set InDoubt status before
+ * transaction is deleted from ProcArray so the readers who will read csn
+ * in the gap between ProcArray removal and XidCSN assignment can wait
+ * until XidCSN is finally assigned. See also TransactionIdGetXidCSN().
+ *
+ * This should be called only from parallel group leader before backend is
+ * deleted from ProcArray.
+ */
+void
+CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ XidCSN oldassignedXidCsn = InProgressXidCSN;
+ bool in_progress;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /* Set InDoubt status if it is local transaction */
+ in_progress = pg_atomic_compare_exchange_u64(&proc->assignedXidCsn,
+ &oldassignedXidCsn,
+ InDoubtXidCSN);
+ if (in_progress)
+ {
+ Assert(XidCSNIsInProgress(oldassignedXidCsn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, InDoubtXidCSN);
+ }
+ else
+ {
+ /* Otherwise we should have valid XidCSN by this time */
+ Assert(XidCSNIsNormal(oldassignedXidCsn));
+ Assert(XidCSNIsInDoubt(CSNLogGetCSNByXid(xid)));
+ }
+}
+
+/*
+ * CSNSnapshotCommit
+ *
+ * Write XidCSN that were acquired earlier to CsnLog. Should be
+ * preceded by CSNSnapshotPrecommit() so readers can wait until we finally
+ * finished writing to SLRU.
+ *
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, so that TransactionIdGetXidCSN can wait on this
+ * lock for XidCSN.
+ */
+void
+CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ volatile XidCSN assigned_xid_csn;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ if (!TransactionIdIsValid(xid))
+ {
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsInProgress(assigned_xid_csn));
+ return;
+ }
+
+ /* Finally write resulting XidCSN in SLRU */
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsNormal(assigned_xid_csn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, assigned_xid_csn);
+
+ /* Reset for next transaction */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2f7d4ed59a..537c4ea991 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,8 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
+#include "access/csn_log.h"
#include "access/htup_details.h"
#include "access/subtrans.h"
#include "access/transam.h"
@@ -1476,8 +1478,34 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nabortrels, abortrels,
gid);
+ /*
+ * CSNSnapshot callbacks that should be called right before we are
+ * going to become visible. Details in comments to this functions.
+ */
+ if (isCommit)
+ CSNSnapshotPrecommit(proc, xid, hdr->nsubxacts, children);
+ else
+ CSNSnapshotAbort(proc, xid, hdr->nsubxacts, children);
+
+
ProcArrayRemove(proc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CSNLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, since TransactionIdGetXidCSN relies on
+ * XactLockTableWait to await xid_csn.
+ */
+ if (isCommit)
+ {
+ CSNSnapshotCommit(proc, xid, hdr->nsubxacts, children);
+ }
+ else
+ {
+ Assert(XidCSNIsInProgress(
+ pg_atomic_read_u64(&proc->assignedXidCsn)));
+ }
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2570e7086a..d24b612f1c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/xact.h"
@@ -173,6 +174,7 @@ GetNewTransactionId(bool isSubXact)
* Extend pg_subtrans and pg_commit_ts too.
*/
ExtendCLOG(xid);
+ ExtendCSNLog(xid);
ExtendCommitTs(xid);
ExtendSUBTRANS(xid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3984dd3e1a..d58127e15b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
@@ -1433,6 +1434,14 @@ RecordTransactionCommit(void)
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
+
+ /*
+ * Mark our transaction as InDoubt in CsnLog and get ready for
+ * commit.
+ */
+ if (markXidCommitted)
+ CSNSnapshotPrecommit(MyProc, xid, nchildren, children);
+
cleanup:
/* Clean up local data */
if (rels)
@@ -1694,6 +1703,11 @@ RecordTransactionAbort(bool isSubXact)
*/
TransactionIdAbortTree(xid, nchildren, children);
+ /*
+ * Mark our transaction as Aborted in CsnLog.
+ */
+ CSNSnapshotAbort(MyProc, xid, nchildren, children);
+
END_CRIT_SECTION();
/* Compute latestXid while we have the child XIDs handy */
@@ -2183,6 +2197,21 @@ CommitTransaction(void)
*/
ProcArrayEndTransaction(MyProc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CsnLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks.
+ */
+ if (!is_parallel_worker)
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+ TransactionId *subxids;
+ int nsubxids;
+
+ nsubxids = xactGetCommittedChildren(&subxids);
+ CSNSnapshotCommit(MyProc, xid, nsubxids, subxids);
+ }
+
/*
* This is all post-commit cleanup. Note that if an error is raised here,
* it's too late to abort the transaction. This should be just
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d3d670928..b7350249da 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
@@ -5345,6 +5346,7 @@ BootStrapXLOG(void)
/* Bootstrap the commit log, too */
BootStrapCLOG();
+ BootStrapCSNLog();
BootStrapCommitTs();
BootStrapSUBTRANS();
BootStrapMultiXact();
@@ -7054,6 +7056,7 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7871,6 +7874,7 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -8518,6 +8522,7 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
ShutdownCLOG();
+ ShutdownCSNLog();
ShutdownCommitTs();
ShutdownSUBTRANS();
ShutdownMultiXact();
@@ -9090,7 +9095,10 @@ CreateCheckPoint(int flags)
* StartupSUBTRANS hasn't been called yet.
*/
if (!RecoveryInProgress())
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update stats before releasing lock. */
LogCheckpointEnd(false);
@@ -9166,6 +9174,7 @@ static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointCLOG();
+ CheckPointCSNLog();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
@@ -9450,7 +9459,10 @@ CreateRestartPoint(int flags)
* this because StartupSUBTRANS hasn't been called yet.
*/
if (EnableHotStandby)
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update before releasing lock. */
LogCheckpointEnd(true);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..7122babfd6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,11 +16,13 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -125,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
size = add_size(size, CLOGShmemSize());
+ size = add_size(size, CSNLogShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
size = add_size(size, TwoPhaseShmemSize());
@@ -143,6 +146,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
+ size = add_size(size, CSNSnapshotShmemSize());
size = add_size(size, SnapMgrShmemSize());
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
@@ -213,6 +217,7 @@ CreateSharedMemoryAndSemaphores(void)
*/
XLOGShmemInit();
CLOGShmemInit();
+ CSNLogShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
@@ -264,6 +269,7 @@ CreateSharedMemoryAndSemaphores(void)
SyncScanShmemInit();
AsyncShmemInit();
+ CSNSnapshotShmemInit();
#ifdef EXEC_BACKEND
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 363000670b..037f0b78c5 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,8 @@
#include <signal.h>
#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -354,6 +356,14 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT PREPARED. After lock is released consequent
+ * CSNSnapshotCommit() will write this value to CsnLog.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
else
{
@@ -469,6 +479,16 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT.
+ *
+ * TODO: in case of group commit we can generate one CSNSnapshot for
+ * whole group to save time on timestamp aquisition.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
/*
@@ -835,6 +855,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
{
ExtendSUBTRANS(latestObservedXid);
+ ExtendCSNLog(latestObservedXid);
TransactionIdAdvance(latestObservedXid);
}
TransactionIdRetreat(latestObservedXid); /* = running->nextXid - 1 */
@@ -1513,6 +1534,7 @@ GetSnapshotData(Snapshot snapshot)
int count = 0;
int subcount = 0;
bool suboverflowed = false;
+ XidCSN xid_csn = FrozenXidCSN;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1710,6 +1732,13 @@ GetSnapshotData(Snapshot snapshot)
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
+ /*
+ * Take XidCSN under ProcArrayLock so the snapshot stays
+ * synchronized.
+ */
+ if (enable_csn_snapshot)
+ xid_csn = GenerateCSN(false);
+
LWLockRelease(ProcArrayLock);
/*
@@ -1780,6 +1809,8 @@ GetSnapshotData(Snapshot snapshot)
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
+ snapshot->snapshot_csn = xid_csn;
+
return snapshot;
}
@@ -3337,6 +3368,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
while (TransactionIdPrecedes(next_expected_xid, xid))
{
TransactionIdAdvance(next_expected_xid);
+ ExtendCSNLog(next_expected_xid);
ExtendSUBTRANS(next_expected_xid);
}
Assert(next_expected_xid == xid);
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..1a4226ee67 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock 41
OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
CLogTruncationLock 44
+CSNLogControlLock 45
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5aa19d3f78..813e51ba45 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -37,6 +37,7 @@
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -441,6 +442,8 @@ InitProcess(void)
MyProc->clogGroupMemberLsn = InvalidXLogRecPtr;
Assert(pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PGPROCNO);
+ pg_atomic_init_u64(&MyProc->assignedXidCsn, InProgressXidCSN);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..287177a896 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/csn_snapshot.h"
#include "access/rmgr.h"
#include "access/tableam.h"
#include "access/transam.h"
@@ -1172,6 +1173,15 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_csn_snapshot", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Enable csn-base snapshot."),
+ gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
+ },
+ &enable_csn_snapshot,
+ true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ NULL, NULL, NULL
+ },
{
{"ssl", PGC_SIGHUP, CONN_AUTH_SSL,
gettext_noop("Enables SSL connections."),
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index a0b0458108..679c531622 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
probe clog__checkpoint__done(bool);
probe subtrans__checkpoint__start(bool);
probe subtrans__checkpoint__done(bool);
+ probe csnlog__checkpoint__start(bool);
+ probe csnlog__checkpoint__done(bool);
probe multixact__checkpoint__start(bool);
probe multixact__checkpoint__done(bool);
probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..6fbf64da21 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -229,6 +229,7 @@ static TimestampTz AlignTimestampToMinuteBoundary(TimestampTz ts);
static Snapshot CopySnapshot(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
+static bool XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot);
/*
* Snapshot fields to be serialized.
@@ -247,6 +248,7 @@ typedef struct SerializedSnapshotData
CommandId curcid;
TimestampTz whenTaken;
XLogRecPtr lsn;
+ XidCSN xid_csn;
} SerializedSnapshotData;
Size
@@ -2115,6 +2117,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
serialized_snapshot.curcid = snapshot->curcid;
serialized_snapshot.whenTaken = snapshot->whenTaken;
serialized_snapshot.lsn = snapshot->lsn;
+ serialized_snapshot.xid_csn = snapshot->snapshot_csn;
/*
* Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2189,6 +2192,7 @@ RestoreSnapshot(char *start_address)
snapshot->curcid = serialized_snapshot.curcid;
snapshot->whenTaken = serialized_snapshot.whenTaken;
snapshot->lsn = serialized_snapshot.lsn;
+ snapshot->snapshot_csn = serialized_snapshot.xid_csn;
/* Copy XIDs, if present. */
if (serialized_snapshot.xcnt > 0)
@@ -2229,6 +2233,47 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc)
/*
* XidInMVCCSnapshot
+ *
+ * Check whether this xid is in snapshot. When enable_csn_snapshot is
+ * switched off just call XidInLocalMVCCSnapshot().
+ */
+bool
+XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ bool in_snapshot;
+
+ in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
+
+ if (!enable_csn_snapshot)
+ {
+ Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
+ return in_snapshot;
+ }
+
+ if (in_snapshot)
+ {
+ /*
+ * This xid may be already in unknown state and in that case
+ * we must wait and recheck.
+ */
+ return XidInvisibleInCSNSnapshot(xid, snapshot);
+ }
+ else
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* Check that csn snapshot gives the same results as local one */
+ if (XidInvisibleInCSNSnapshot(xid, snapshot))
+ {
+ XidCSN gcsn = TransactionIdGetXidCSN(xid);
+ Assert(XidCSNIsAborted(gcsn));
+ }
+#endif
+ return false;
+ }
+}
+
+/*
+ * XidInLocalMVCCSnapshot
* Is the given XID still-in-progress according to the snapshot?
*
* Note: GetSnapshotData never stores either top xid or subxids of our own
@@ -2237,8 +2282,8 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc)
* TransactionIdIsCurrentTransactionId first, except when it's known the
* XID could not be ours anyway.
*/
-bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+static bool
+XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index a6577486ce..6902d1e140 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -220,7 +220,8 @@ static const char *const subdirs[] = {
"pg_xact",
"pg_logical",
"pg_logical/snapshots",
- "pg_logical/mappings"
+ "pg_logical/mappings",
+ "pg_csn"
};
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..9b9611127d
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Commit-Sequence-Number log.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
new file mode 100644
index 0000000000..1894586204
--- /dev/null
+++ b/src/include/access/csn_snapshot.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.h
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_snapshot.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CSN_SNAPSHOT_H
+#define CSN_SNAPSHOT_H
+
+#include "port/atomics.h"
+#include "storage/lock.h"
+#include "utils/snapshot.h"
+#include "utils/guc.h"
+
+/*
+ * snapshot.h is used in frontend code so atomic variant of SnapshotCSN type
+ * is defined here.
+ */
+typedef pg_atomic_uint64 CSN_atomic;
+
+#define InProgressXidCSN UINT64CONST(0x0)
+#define AbortedXidCSN UINT64CONST(0x1)
+#define FrozenXidCSN UINT64CONST(0x2)
+#define InDoubtXidCSN UINT64CONST(0x3)
+#define FirstNormalXidCSN UINT64CONST(0x4)
+
+#define XidCSNIsInProgress(csn) ((csn) == InProgressXidCSN)
+#define XidCSNIsAborted(csn) ((csn) == AbortedXidCSN)
+#define XidCSNIsFrozen(csn) ((csn) == FrozenXidCSN)
+#define XidCSNIsInDoubt(csn) ((csn) == InDoubtXidCSN)
+#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
+
+
+
+
+extern Size CSNSnapshotShmemSize(void);
+extern void CSNSnapshotShmemInit(void);
+
+extern SnapshotCSN GenerateCSN(bool locked);
+
+extern bool XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot);
+
+extern XidCSN TransactionIdGetXidCSN(TransactionId xid);
+
+extern void CSNSnapshotAbort(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+
+#endif /* CSN_SNAPSHOT_H */
diff --git a/src/include/datatype/timestamp.h b/src/include/datatype/timestamp.h
index 6be6d35d1e..583b1beea5 100644
--- a/src/include/datatype/timestamp.h
+++ b/src/include/datatype/timestamp.h
@@ -93,6 +93,9 @@ typedef struct
#define USECS_PER_MINUTE INT64CONST(60000000)
#define USECS_PER_SEC INT64CONST(1000000)
+#define NSECS_PER_SEC INT64CONST(1000000000)
+#define NSECS_PER_USEC INT64CONST(1000)
+
/*
* We allow numeric timezone offsets up to 15:59:59 either way from Greenwich.
* Currently, the record holders for wackiest offsets in actual use are zones
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index d349510b7c..5cdf2e17cb 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -280,6 +280,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
#define PG_GETARG_FLOAT4(n) DatumGetFloat4(PG_GETARG_DATUM(n))
#define PG_GETARG_FLOAT8(n) DatumGetFloat8(PG_GETARG_DATUM(n))
#define PG_GETARG_INT64(n) DatumGetInt64(PG_GETARG_DATUM(n))
+#define PG_GETARG_UINT64(n) DatumGetUInt64(PG_GETARG_DATUM(n))
/* use this if you want the raw, possibly-toasted input datum: */
#define PG_GETARG_RAW_VARLENA_P(n) ((struct varlena *) PG_GETARG_POINTER(n))
/* use this if you want the input datum de-toasted: */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..4ac23da654 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -141,6 +141,9 @@ typedef struct timespec instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
#else /* !HAVE_CLOCK_GETTIME */
/* Use gettimeofday() */
@@ -205,6 +208,10 @@ typedef struct timeval instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + \
+ (uint64) (t).tv_usec * (uint64) 1000)
+
#endif /* HAVE_CLOCK_GETTIME */
#else /* WIN32 */
@@ -237,6 +244,9 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
static inline double
GetTimerFrequency(void)
{
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..af39b89877 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -198,6 +198,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_CLOG_BUFFERS = NUM_INDIVIDUAL_LWLOCKS,
LWTRANCHE_COMMITTS_BUFFERS,
LWTRANCHE_SUBTRANS_BUFFERS,
+ LWTRANCHE_CSN_LOG_BUFFERS,
LWTRANCHE_MXACTOFFSET_BUFFERS,
LWTRANCHE_MXACTMEMBER_BUFFERS,
LWTRANCHE_ASYNC_BUFFERS,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ae4f573ab4..0873e8f240 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,8 +15,10 @@
#define _PROC_H_
#include "access/clog.h"
+#include "access/csn_snapshot.h"
#include "access/xlogdefs.h"
#include "lib/ilist.h"
+#include "utils/snapshot.h"
#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
@@ -205,6 +207,16 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * assignedXidCsn holds XidCSN for this transaction. It is generated
+ * under a ProcArray lock and later is writter to a CSNLog. This
+ * variable defined as atomic only for case of group commit, in all other
+ * scenarios only backend responsible for this proc entry is working with
+ * this variable.
+ */
+ CSN_atomic assignedXidCsn;
+
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c064..dd6445205d 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -35,7 +35,7 @@
* decoding outside xact */
#define PROCARRAY_SLOTS_XMIN 0x20 /* replication slot xmin,
- * catalog_xmin */
+ * catalog_xmin */
/*
* Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
* PGXACT->vacuumFlags. Other flags are used for different purposes and
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63a..9f622c76a7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -121,6 +121,9 @@ typedef enum SnapshotType
typedef struct SnapshotData *Snapshot;
#define InvalidSnapshot ((Snapshot) NULL)
+typedef uint64 XidCSN;
+typedef uint64 SnapshotCSN;
+extern bool enable_csn_snapshot;
/*
* Struct representing all kind of possible snapshots.
@@ -201,6 +204,12 @@ typedef struct SnapshotData
TimestampTz whenTaken; /* timestamp when snapshot was taken */
XLogRecPtr lsn; /* position in the WAL stream when taken */
+
+ /*
+ * SnapshotCSN for snapshot isolation support.
+ * Will be used only if enable_csn_snapshot is enabled.
+ */
+ SnapshotCSN snapshot_csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a126f0ad61..86a5df0cba 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,6 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
+ enable_csn_snapshot | on
enable_gathermerge | on
enable_groupingsets_hash_disk | off
enable_hashagg | on
@@ -92,7 +93,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(20 rows)
+(21 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
0002-Wal-for-csn.patchapplication/octet-stream; name=0002-Wal-for-csn.patchDownload
src/backend/access/rmgrdesc/Makefile | 1
src/backend/access/rmgrdesc/csnlogdesc.c | 95 +++++++++++++++
src/backend/access/rmgrdesc/xlogdesc.c | 6 +
src/backend/access/transam/csn_log.c | 187 ++++++++++++++++++++++-------
src/backend/access/transam/csn_snapshot.c | 72 ++++++++++-
src/backend/access/transam/rmgr.c | 1
src/backend/access/transam/xlog.c | 12 +-
src/backend/commands/vacuum.c | 3
src/backend/storage/ipc/procarray.c | 2
src/backend/utils/time/snapmgr.c | 2
src/bin/pg_controldata/pg_controldata.c | 2
src/bin/pg_upgrade/pg_upgrade.c | 5 +
src/bin/pg_upgrade/pg_upgrade.h | 2
src/bin/pg_waldump/rmgrdesc.c | 1
src/include/access/csn_log.h | 29 ++++
src/include/access/rmgrlist.h | 1
src/include/access/xlog_internal.h | 1
src/include/catalog/pg_control.h | 1
18 files changed, 359 insertions(+), 64 deletions(-)
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..15fc36f7b4 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
brindesc.o \
clogdesc.o \
+ csnlogdesc.o \
committsdesc.o \
dbasedesc.o \
genericdesc.o \
diff --git a/src/backend/access/rmgrdesc/csnlogdesc.c b/src/backend/access/rmgrdesc/csnlogdesc.c
new file mode 100644
index 0000000000..e96b056325
--- /dev/null
+++ b/src/backend/access/rmgrdesc/csnlogdesc.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * clogdesc.c
+ * rmgr descriptor routines for access/transam/csn_log.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/csnlogdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+
+
+void
+csnlog_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ appendStringInfo(buf, "assign "INT64_FORMAT"", csn);
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) rec;
+ int nsubxids;
+
+ appendStringInfo(buf, "set "INT64_FORMAT" for: %u",
+ xlrec->xidcsn,
+ xlrec->xtop);
+ nsubxids = ((XLogRecGetDataLen(record) - MinSizeOfXidCSNSet) /
+ sizeof(TransactionId));
+ if (nsubxids > 0)
+ {
+ int i;
+ TransactionId *subxids;
+
+ subxids = palloc(sizeof(TransactionId) * nsubxids);
+ memcpy(subxids,
+ XLogRecGetData(record) + MinSizeOfXidCSNSet,
+ sizeof(TransactionId) * nsubxids);
+ for (i = 0; i < nsubxids; i++)
+ appendStringInfo(buf, ", %u", subxids[i]);
+ pfree(subxids);
+ }
+ }
+}
+
+const char *
+csnlog_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_CSN_ASSIGNMENT:
+ id = "ASSIGNMENT";
+ break;
+ case XLOG_CSN_SETXIDCSN:
+ id = "SETXIDCSN";
+ break;
+ case XLOG_CSN_ZEROPAGE:
+ id = "ZEROPAGE";
+ break;
+ case XLOG_CSN_TRUNCATE:
+ id = "TRUNCATE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..44e2e8ecec 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -114,7 +114,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
"max_wal_senders=%d max_prepared_xacts=%d "
"max_locks_per_xact=%d wal_level=%s "
- "wal_log_hints=%s track_commit_timestamp=%s",
+ "wal_log_hints=%s track_commit_timestamp=%s "
+ "enable_csn_snapshot=%s",
xlrec.MaxConnections,
xlrec.max_worker_processes,
xlrec.max_wal_senders,
@@ -122,7 +123,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
xlrec.max_locks_per_xact,
wal_level_str,
xlrec.wal_log_hints ? "on" : "off",
- xlrec.track_commit_timestamp ? "on" : "off");
+ xlrec.track_commit_timestamp ? "on" : "off",
+ xlrec.enable_csn_snapshot ? "on" : "off");
}
else if (info == XLOG_FPW_CHANGE)
{
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 4e0b8d64e4..4577e61fc3 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -9,6 +9,11 @@
* transactions. Because of same lifetime and persistancy requirements
* this module is quite similar to subtrans.c
*
+ * If we switch database from CSN-base snapshot to xid-base snapshot then,
+ * nothing wrong. But if we switch xid-base snapshot to CSN-base snapshot
+ * it should decide a new xid whwich begin csn-base check. It can not be
+ * oldestActiveXID because of prepared transaction.
+ *
* Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -52,7 +57,8 @@ bool enable_csn_snapshot;
static SlruCtlData CSNLogCtlData;
#define CsnlogCtl (&CSNLogCtlData)
-static int ZeroCSNLogPage(int pageno);
+static int ZeroCSNLogPage(int pageno, bool write_xlog);
+static void ZeroTruncateCSNLogPage(int pageno, bool write_xlog);
static bool CSNLogPagePrecedes(int page1, int page2);
static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids,
@@ -60,6 +66,11 @@ static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
int slotno);
+static void WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+static void WriteZeroCSNPageXlogRec(int pageno);
+static void WriteTruncateCSNXlogRec(int pageno);
+
/*
* CSNLogSetCSN
*
@@ -77,7 +88,7 @@ static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
*/
void
CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn)
+ TransactionId *subxids, XidCSN csn, bool write_xlog)
{
int pageno;
int i = 0;
@@ -89,6 +100,10 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
+
+ if(write_xlog)
+ WriteXidCsnXlogRec(xid, nsubxids, subxids, csn);
+
for (;;)
{
int num_on_page = 0;
@@ -180,11 +195,7 @@ CSNLogGetCSNByXid(TransactionId xid)
/* Callers of CSNLogGetCSNByXid() must check GUC params */
Assert(enable_csn_snapshot);
- /* Can't ask about stuff that might not be around anymore */
- Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
-
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
xid_csn = *ptr;
@@ -245,7 +256,7 @@ BootStrapCSNLog(void)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0);
+ slotno = ZeroCSNLogPage(0, false);
/* Make sure it's written out */
SimpleLruWritePage(CsnlogCtl, slotno);
@@ -263,50 +274,20 @@ BootStrapCSNLog(void)
* Control lock must be held at entry, and will be held at exit.
*/
static int
-ZeroCSNLogPage(int pageno)
+ZeroCSNLogPage(int pageno, bool write_xlog)
{
Assert(LWLockHeldByMe(CSNLogControlLock));
+ if(write_xlog)
+ WriteZeroCSNPageXlogRec(pageno);
return SimpleLruZeroPage(CsnlogCtl, pageno);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend startup,
- * after StartupXLOG has initialized ShmemVariableCache->nextXid.
- *
- * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
- * if there are none.
- */
-void
-StartupCSNLog(TransactionId oldestActiveXID)
+static void
+ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
{
- int startPage;
- int endPage;
-
- if (!enable_csn_snapshot)
- return;
-
- /*
- * Since we don't expect pg_csn to be valid across crashes, we
- * initialize the currently-active page(s) to zeroes during startup.
- * Whenever we advance into a new page, ExtendCSNLog will likewise
- * zero the new page without regard to whatever was previously on disk.
- */
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- startPage = TransactionIdToPage(oldestActiveXID);
- endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
-
- while (startPage != endPage)
- {
- (void) ZeroCSNLogPage(startPage);
- startPage++;
- /* must account for wraparound */
- if (startPage > TransactionIdToPage(MaxTransactionId))
- startPage = 0;
- }
- (void) ZeroCSNLogPage(startPage);
-
- LWLockRelease(CSNLogControlLock);
+ if(write_xlog)
+ WriteTruncateCSNXlogRec(pageno);
+ SimpleLruTruncate(CsnlogCtl, pageno);
}
/*
@@ -379,7 +360,7 @@ ExtendCSNLog(TransactionId newestXact)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
- ZeroCSNLogPage(pageno);
+ ZeroCSNLogPage(pageno, !InRecovery);
LWLockRelease(CSNLogControlLock);
}
@@ -410,7 +391,7 @@ TruncateCSNLog(TransactionId oldestXact)
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
- SimpleLruTruncate(CsnlogCtl, cutoffPage);
+ ZeroTruncateCSNLogPage(cutoffPage, true);
}
/*
@@ -436,3 +417,115 @@ CSNLogPagePrecedes(int page1, int page2)
return TransactionIdPrecedes(xid1, xid2);
}
+
+void
+WriteAssignCSNXlogRec(XidCSN xidcsn)
+{
+ XidCSN log_csn = 0;
+
+ if(xidcsn > get_last_log_wal_csn())
+ {
+ log_csn = CSNAddByNanosec(xidcsn, 20);
+ set_last_log_wal_csn(log_csn);
+ }
+ else
+ {
+ return;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&log_csn), sizeof(XidCSN));
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ASSIGNMENT);
+}
+
+static void
+WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ xl_xidcsn_set xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.xtop = xid;
+ xlrec.nsubxacts = nsubxids;
+ xlrec.xidcsn = csn;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
+ XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
+ XLogFlush(recptr);
+}
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroCSNPageXlogRec(int pageno)
+{
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ (void) XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ZEROPAGE);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateCSNXlogRec(int pageno)
+{
+ XLogRecPtr recptr;
+ return;
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
+ XLogFlush(recptr);
+}
+
+
+void
+csnlog_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ /* Backup blocks are not used in csnlog records */
+ Assert(!XLogRecHasAnyBlockRefs(record));
+
+ if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ set_last_max_csn(csn);
+ LWLockRelease(CSNLogControlLock);
+
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) XLogRecGetData(record);
+ CSNLogSetCSN(xlrec->xtop, xlrec->nsubxacts, xlrec->xsub, xlrec->xidcsn, false);
+ }
+ else if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+ int slotno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(pageno, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ CsnlogCtl->shared->latest_page_number = pageno;
+ ZeroTruncateCSNLogPage(pageno, false);
+ }
+ else
+ elog(PANIC, "csnlog_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index e2d4d2649e..a3d164d77e 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -31,6 +31,8 @@
/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+TransactionId xmin_for_csn = InvalidTransactionId;
+
/*
* CSNSnapshotState
*
@@ -40,7 +42,9 @@
*/
typedef struct
{
- SnapshotCSN last_max_csn;
+ SnapshotCSN last_max_csn; /* Record the max csn till now */
+ XidCSN last_csn_log_wal; /* for interval we log the assign csn to wal */
+ TransactionId xmin_for_csn; /*'xmin_for_csn' for when turn xid-snapshot to csn-snapshot*/
volatile slock_t lock;
} CSNSnapshotState;
@@ -80,6 +84,7 @@ CSNSnapshotShmemInit()
if (!found)
{
csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
SpinLockInit(&csnState->lock);
}
}
@@ -116,6 +121,8 @@ GenerateCSN(bool locked)
else
csnState->last_max_csn = csn;
+ WriteAssignCSNXlogRec(csn);
+
if (!locked)
SpinLockRelease(&csnState->lock);
@@ -131,7 +138,7 @@ GenerateCSN(bool locked)
XidCSN
TransactionIdGetXidCSN(TransactionId xid)
{
- XidCSN xid_csn;
+ XidCSN xid_csn;
Assert(enable_csn_snapshot);
@@ -145,13 +152,35 @@ TransactionIdGetXidCSN(TransactionId xid)
Assert(false); /* Should not happend */
}
+ /*
+ * If we just switch a xid-snapsot to a csn_snapshot, we should handle a start
+ * xid for csn basse check. Just in case we have prepared transaction which
+ * hold the TransactionXmin but without CSN.
+ */
+ if(InvalidTransactionId == xmin_for_csn)
+ {
+ SpinLockAcquire(&csnState->lock);
+ if(InvalidTransactionId != csnState->xmin_for_csn)
+ xmin_for_csn = csnState->xmin_for_csn;
+ else
+ xmin_for_csn = FrozenTransactionId;
+
+ SpinLockRelease(&csnState->lock);
+ }
+
+ if ( FrozenTransactionId != xmin_for_csn ||
+ TransactionIdPrecedes(xmin_for_csn, TransactionXmin))
+ {
+ xmin_for_csn = TransactionXmin;
+ }
+
/*
* For xids which less then TransactionXmin CSNLog can be already
* trimmed but we know that such transaction is definetly not concurrently
* running according to any snapshot including timetravel ones. Callers
* should check TransactionDidCommit after.
*/
- if (TransactionIdPrecedes(xid, TransactionXmin))
+ if (TransactionIdPrecedes(xid, xmin_for_csn))
return FrozenXidCSN;
/* Read XidCSN from SLRU */
@@ -251,7 +280,7 @@ CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
if (!enable_csn_snapshot)
return;
- CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
/*
* Clean assignedXidCsn anyway, as it was possibly set in
@@ -292,7 +321,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
{
Assert(XidCSNIsInProgress(oldassignedXidCsn));
CSNLogSetCSN(xid, nsubxids,
- subxids, InDoubtXidCSN);
+ subxids, InDoubtXidCSN, true);
}
else
{
@@ -333,8 +362,39 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
Assert(XidCSNIsNormal(assigned_xid_csn));
CSNLogSetCSN(xid, nsubxids,
- subxids, assigned_xid_csn);
+ subxids, assigned_xid_csn, true);
/* Reset for next transaction */
pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
}
+
+void
+set_last_max_csn(XidCSN xidcsn)
+{
+ csnState->last_max_csn = xidcsn;
+}
+
+void
+set_last_log_wal_csn(XidCSN xidcsn)
+{
+ csnState->last_csn_log_wal = xidcsn;
+}
+
+XidCSN
+get_last_log_wal_csn(void)
+{
+ XidCSN last_csn_log_wal;
+
+ last_csn_log_wal = csnState->last_csn_log_wal;
+
+ return last_csn_log_wal;
+}
+
+/*
+ * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ */
+void
+set_xmin_for_csn(void)
+{
+ csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..b1e5ec350e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -28,6 +28,7 @@
#include "replication/origin.h"
#include "storage/standby.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b7350249da..7187bb0be3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4604,6 +4604,7 @@ InitControlFile(uint64 sysidentifier)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
ControlFile->data_checksum_version = bootstrap_data_checksum_version;
}
@@ -7056,7 +7057,6 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7874,7 +7874,6 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -9097,7 +9096,6 @@ CreateCheckPoint(int flags)
if (!RecoveryInProgress())
{
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
- TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
}
/* Real work is done, but log and update stats before releasing lock. */
@@ -9720,7 +9718,8 @@ XLogReportParameters(void)
max_wal_senders != ControlFile->max_wal_senders ||
max_prepared_xacts != ControlFile->max_prepared_xacts ||
max_locks_per_xact != ControlFile->max_locks_per_xact ||
- track_commit_timestamp != ControlFile->track_commit_timestamp)
+ track_commit_timestamp != ControlFile->track_commit_timestamp ||
+ enable_csn_snapshot != ControlFile->enable_csn_snapshot)
{
/*
* The change in number of backend slots doesn't need to be WAL-logged
@@ -9742,6 +9741,7 @@ XLogReportParameters(void)
xlrec.wal_level = wal_level;
xlrec.wal_log_hints = wal_log_hints;
xlrec.track_commit_timestamp = track_commit_timestamp;
+ xlrec.enable_csn_snapshot = enable_csn_snapshot;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9750,6 +9750,8 @@ XLogReportParameters(void)
XLogFlush(recptr);
}
+ if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
+ set_xmin_for_csn();
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -9758,6 +9760,7 @@ XLogReportParameters(void)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
UpdateControlFile();
}
}
@@ -10184,6 +10187,7 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..0f301b1db0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -53,7 +53,7 @@
#include "utils/memutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
-
+#include "access/csn_log.h"
/*
* GUC parameters
@@ -1632,6 +1632,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
TruncateCLOG(frozenXID, oldestxid_datoid);
TruncateCommitTs(frozenXID);
+ TruncateCSNLog(frozenXID);
TruncateMultiXact(minMulti, minmulti_datoid);
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 158fb9d31f..c671d92ead 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1736,7 +1736,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index e2baeb9222..218f32e8ec 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2265,7 +2265,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
if (XidInvisibleInCSNSnapshot(xid, snapshot))
{
XidCSN gcsn = TransactionIdGetXidCSN(xid);
- Assert(XidCSNIsAborted(gcsn));
+ Assert(XidCSNIsAborted(gcsn) || XidCSNIsInProgress(gcsn));
}
#endif
return false;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..e7194124c7 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -306,6 +306,8 @@ main(int argc, char *argv[])
ControlFile->max_locks_per_xact);
printf(_("track_commit_timestamp setting: %s\n"),
ControlFile->track_commit_timestamp ? _("on") : _("off"));
+ printf(_("enable_csn_snapshot setting: %s\n"),
+ ControlFile->enable_csn_snapshot ? _("on") : _("off"));
printf(_("Maximum data alignment: %u\n"),
ControlFile->maxAlign);
/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 70194eb096..863ee73d24 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -545,6 +545,11 @@ copy_xact_xlog_xid(void)
check_ok();
}
+ if(old_cluster.controldata.cat_ver > CSN_BASE_SNAPSHOT_ADD_VER)
+ {
+ copy_subdir_files("pg_csn", "pg_csn");
+ }
+
/* now reset the wal archives in the new cluster */
prep_status("Resetting WAL archives");
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 8b90cefbe0..f35860dfc5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -123,6 +123,8 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+#define CSN_BASE_SNAPSHOT_ADD_VER 202002010
+
/*
* Each relation is represented by a relinfo structure.
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..282bae882a 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -31,6 +31,7 @@
#include "rmgrdesc.h"
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
{ name, desc, identify},
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 9b9611127d..b973e0c2ce 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -14,17 +14,42 @@
#include "access/xlog.h"
#include "utils/snapshot.h"
+/* XLOG stuff */
+#define XLOG_CSN_ASSIGNMENT 0x00
+#define XLOG_CSN_SETXIDCSN 0x10
+#define XLOG_CSN_ZEROPAGE 0x20
+#define XLOG_CSN_TRUNCATE 0x30
+
+typedef struct xl_xidcsn_set
+{
+ XidCSN xidcsn;
+ TransactionId xtop; /* XID's top-level XID */
+ int nsubxacts; /* number of subtransaction XIDs */
+ TransactionId xsub[FLEXIBLE_ARRAY_MEMBER]; /* assigned subxids */
+} xl_xidcsn_set;
+
+#define MinSizeOfXidCSNSet offsetof(xl_xidcsn_set, xsub)
+#define CSNAddByNanosec(csn,second) (csn + second * 1000000000L)
+
extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn);
+ TransactionId *subxids, XidCSN csn, bool write_xlog);
extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
extern Size CSNLogShmemSize(void);
extern void CSNLogShmemInit(void);
extern void BootStrapCSNLog(void);
-extern void StartupCSNLog(TransactionId oldestActiveXID);
extern void ShutdownCSNLog(void);
extern void CheckPointCSNLog(void);
extern void ExtendCSNLog(TransactionId newestXact);
extern void TruncateCSNLog(TransactionId oldestXact);
+extern void csnlog_redo(XLogReaderState *record);
+extern void csnlog_desc(StringInfo buf, XLogReaderState *record);
+extern const char *csnlog_identify(uint8 info);
+extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
+extern void set_last_max_csn(XidCSN xidcsn);
+extern void set_last_log_wal_csn(XidCSN xidcsn);
+extern XidCSN get_last_log_wal_csn(void);
+extern void set_xmin_for_csn(void);
+
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 6c15df7e70..b2d12bfb27 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CSNLOG_ID, "CSN", csnlog_redo, csnlog_desc, csnlog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index c8869d5226..729cf5bc56 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -236,6 +236,7 @@ typedef struct xl_parameter_change
int wal_level;
bool wal_log_hints;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
} xl_parameter_change;
/* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..9e5d4b0fc0 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -181,6 +181,7 @@ typedef struct ControlFileData
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
/*
* This data is used to check for hardware-architecture compatibility of
Import Notes
Reply to msg id not found:
Hello hackers,
Currently, I do some changes based on the last version:
1. Catch up to the current commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Attachments:
0001-CSN-base-snapshot.patchapplication/octet-stream; name=0001-CSN-base-snapshot.patchDownload
Author: movead
Date: Fri Jun 12 17:13:26 2020 +0800
src/backend/access/transam/Makefile | 2 +
src/backend/access/transam/csn_log.c | 438 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/backend/access/transam/csn_snapshot.c | 340 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/backend/access/transam/twophase.c | 28 ++++++++++++++
src/backend/access/transam/varsup.c | 2 +
src/backend/access/transam/xact.c | 29 +++++++++++++++
src/backend/access/transam/xlog.c | 12 ++++++
src/backend/storage/ipc/ipci.c | 6 +++
src/backend/storage/ipc/procarray.c | 32 ++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 2 +
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/storage/lmgr/proc.c | 3 ++
src/backend/utils/misc/guc.c | 10 +++++
src/backend/utils/probes.d | 2 +
src/backend/utils/time/snapmgr.c | 49 ++++++++++++++++++++++++-
src/bin/initdb/initdb.c | 3 +-
src/include/access/csn_log.h | 30 +++++++++++++++
src/include/access/csn_snapshot.h | 58 +++++++++++++++++++++++++++++
src/include/datatype/timestamp.h | 3 ++
src/include/fmgr.h | 1 +
src/include/portability/instr_time.h | 10 +++++
src/include/storage/lwlock.h | 1 +
src/include/storage/proc.h | 12 ++++++
src/include/utils/snapshot.h | 9 +++++
src/test/regress/expected/sysviews.out | 3 +-
25 files changed, 1082 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..fc0321ee6b 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,8 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
clog.o \
commit_ts.o \
+ csn_log.o \
+ csn_snapshot.o \
generic_xlog.o \
multixact.o \
parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..4e0b8d64e4
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,438 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ * Track commit sequence numbers of finished transactions
+ *
+ * This module provides SLRU to store CSN for each transaction. This
+ * mapping need to be kept only for xid's greater then oldestXid, but
+ * that can require arbitrary large amounts of memory in case of long-lived
+ * transactions. Because of same lifetime and persistancy requirements
+ * this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool enable_csn_snapshot;
+
+/*
+ * Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XidCSN))
+
+#define TransactionIdToPage(xid) ((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int page1, int page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
+ int slotno);
+
+/*
+ * CSNLogSetCSN
+ *
+ * Record XidCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should be
+ * AbortedCSN for abort cases.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ int pageno;
+ int i = 0;
+ int offset = 0;
+
+ /* Callers of CSNLogSetCSN() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ Assert(TransactionIdIsValid(xid));
+
+ pageno = TransactionIdToPage(xid); /* get page of parent */
+ for (;;)
+ {
+ int num_on_page = 0;
+
+ while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+ {
+ num_on_page++;
+ i++;
+ }
+
+ CSNLogSetPageStatus(xid,
+ num_on_page, subxids + offset,
+ csn, pageno);
+ if (i >= nsubxids)
+ break;
+
+ offset = i;
+ pageno = TransactionIdToPage(subxids[offset]);
+ xid = InvalidTransactionId;
+ }
+}
+
+/*
+ * Record the final state of transaction entries in the csn log for
+ * all entries on a single page. Atomic only on this page.
+ *
+ * Otherwise API is same as TransactionIdSetTreeStatus()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno)
+{
+ int slotno;
+ int i;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+ /* Subtransactions first, if needed ... */
+ for (i = 0; i < nsubxids; i++)
+ {
+ Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+ CSNLogSetCSNInSlot(subxids[i], csn, slotno);
+ }
+
+ /* ... then the main transaction */
+ if (TransactionIdIsValid(xid))
+ CSNLogSetCSNInSlot(xid, csn, slotno);
+
+ CsnlogCtl->shared->page_dirty[slotno] = true;
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
+{
+ int entryno = TransactionIdToPgIndex(xid);
+ XidCSN *ptr;
+
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+ *ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XidCSN
+CSNLogGetCSNByXid(TransactionId xid)
+{
+ int pageno = TransactionIdToPage(xid);
+ int entryno = TransactionIdToPgIndex(xid);
+ int slotno;
+ XidCSN *ptr;
+ XidCSN xid_csn;
+
+ /* Callers of CSNLogGetCSNByXid() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ /* Can't ask about stuff that might not be around anymore */
+ Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+ slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+ xid_csn = *ptr;
+
+ LWLockRelease(CSNLogControlLock);
+
+ return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+ return Min(32, Max(4, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+ if (!enable_csn_snapshot)
+ return 0;
+
+ return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+ SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+ CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+}
+
+/*
+ * This func must be called ONCE on system install. It creates the initial
+ * CSNLog segment. The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ */
+void
+BootStrapCSNLog(void)
+{
+ int slotno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Create and zero the first page of the commit log */
+ slotno = ZeroCSNLogPage(0);
+
+ /* Make sure it's written out */
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+ return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID)
+{
+ int startPage;
+ int endPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Since we don't expect pg_csn to be valid across crashes, we
+ * initialize the currently-active page(s) to zeroes during startup.
+ * Whenever we advance into a new page, ExtendCSNLog will likewise
+ * zero the new page without regard to whatever was previously on disk.
+ */
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ startPage = TransactionIdToPage(oldestActiveXID);
+ endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
+
+ while (startPage != endPage)
+ {
+ (void) ZeroCSNLogPage(startPage);
+ startPage++;
+ /* must account for wraparound */
+ if (startPage > TransactionIdToPage(MaxTransactionId))
+ startPage = 0;
+ }
+ (void) ZeroCSNLogPage(startPage);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely as a debugging aid.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+ SimpleLruFlush(CsnlogCtl, false);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely to improve the odds that writing of dirty pages is done by
+ * the checkpoint process and not by backends.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+ SimpleLruFlush(CsnlogCtl, true);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock. We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+ int pageno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * No work except at first XID of a page. But beware: just after
+ * wraparound, the first XID of page zero is FirstNormalTransactionId.
+ */
+ if (TransactionIdToPgIndex(newestXact) != 0 &&
+ !TransactionIdEquals(newestXact, FirstNormalTransactionId))
+ return;
+
+ pageno = TransactionIdToPage(newestXact);
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Zero the page and make an XLOG entry about it */
+ ZeroCSNLogPage(pageno);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+ int cutoffPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * The cutoff point is the start of the segment containing oldestXact. We
+ * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+ * back one transaction to avoid passing a cutoff page that hasn't been
+ * created yet in the rare case that oldestXact would be the first item on
+ * a page and oldestXact == next XID. In that case, if we didn't subtract
+ * one, we'd trigger SimpleLruTruncate's wraparound detection.
+ */
+ TransactionIdRetreat(oldestXact);
+ cutoffPage = TransactionIdToPage(oldestXact);
+
+ SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic. However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs. So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int page1, int page2)
+{
+ TransactionId xid1;
+ TransactionId xid2;
+
+ xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+ xid1 += FirstNormalTransactionId;
+ xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+ xid2 += FirstNormalTransactionId;
+
+ return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
new file mode 100644
index 0000000000..bcc5bac757
--- /dev/null
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -0,0 +1,340 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.c
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_snapshot.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
+#include "access/transam.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "portability/instr_time.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+#include "miscadmin.h"
+
+/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
+#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+
+/*
+ * CSNSnapshotState
+ *
+ * Do not trust local clocks to be strictly monotonical and save last acquired
+ * value so later we can compare next timestamp with it. Accessed through
+ * GenerateCSN().
+ */
+typedef struct
+{
+ SnapshotCSN last_max_csn;
+ volatile slock_t lock;
+} CSNSnapshotState;
+
+static CSNSnapshotState *csnState;
+
+/*
+ * Enables this module.
+ */
+extern bool enable_csn_snapshot;
+
+
+/* Estimate shared memory space needed */
+Size
+CSNSnapshotShmemSize(void)
+{
+ Size size = 0;
+
+ if (enable_csn_snapshot)
+ {
+ size += MAXALIGN(sizeof(CSNSnapshotState));
+ }
+
+ return size;
+}
+
+/* Init shared memory structures */
+void
+CSNSnapshotShmemInit()
+{
+ bool found;
+
+ if (enable_csn_snapshot)
+ {
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
+ {
+ csnState->last_max_csn = 0;
+ SpinLockInit(&csnState->lock);
+ }
+ }
+}
+
+/*
+ * GenerateCSN
+ *
+ * Generate SnapshotCSN which is actually a local time. Also we are forcing
+ * this time to be always increasing. Since now it is not uncommon to have
+ * millions of read transactions per second we are trying to use nanoseconds
+ * if such time resolution is available.
+ */
+SnapshotCSN
+GenerateCSN(bool locked)
+{
+ instr_time current_time;
+ SnapshotCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ /*
+ * TODO: create some macro that add small random shift to current time.
+ */
+ INSTR_TIME_SET_CURRENT(current_time);
+ csn = (SnapshotCSN) INSTR_TIME_GET_NANOSEC(current_time);
+
+ /* TODO: change to atomics? */
+ if (!locked)
+ SpinLockAcquire(&csnState->lock);
+
+ if (csn <= csnState->last_max_csn)
+ csn = ++csnState->last_max_csn;
+ else
+ csnState->last_max_csn = csn;
+
+ if (!locked)
+ SpinLockRelease(&csnState->lock);
+
+ return csn;
+}
+
+/*
+ * TransactionIdGetXidCSN
+ *
+ * Get XidCSN for specified TransactionId taking care about special xids,
+ * xids beyond TransactionXmin and InDoubt states.
+ */
+XidCSN
+TransactionIdGetXidCSN(TransactionId xid)
+{
+ XidCSN xid_csn;
+
+ Assert(enable_csn_snapshot);
+
+ /* Handle permanent TransactionId's for which we don't have mapping */
+ if (!TransactionIdIsNormal(xid))
+ {
+ if (xid == InvalidTransactionId)
+ return AbortedXidCSN;
+ if (xid == FrozenTransactionId || xid == BootstrapTransactionId)
+ return FrozenXidCSN;
+ Assert(false); /* Should not happend */
+ }
+
+ /*
+ * For xids which less then TransactionXmin CSNLog can be already
+ * trimmed but we know that such transaction is definetly not concurrently
+ * running according to any snapshot including timetravel ones. Callers
+ * should check TransactionDidCommit after.
+ */
+ if (TransactionIdPrecedes(xid, TransactionXmin))
+ return FrozenXidCSN;
+
+ /* Read XidCSN from SLRU */
+ xid_csn = CSNLogGetCSNByXid(xid);
+
+ /*
+ * If we faced InDoubt state then transaction is beeing committed and we
+ * should wait until XidCSN will be assigned so that visibility check
+ * could decide whether tuple is in snapshot. See also comments in
+ * CSNSnapshotPrecommit().
+ */
+ if (XidCSNIsInDoubt(xid_csn))
+ {
+ XactLockTableWait(xid, NULL, NULL, XLTW_None);
+ xid_csn = CSNLogGetCSNByXid(xid);
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+ }
+
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsInProgress(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+
+ return xid_csn;
+}
+
+/*
+ * XidInvisibleInCSNSnapshot
+ *
+ * Version of XidInMVCCSnapshot for transactions. For non-imported
+ * csn snapshots this should give same results as XidInLocalMVCCSnapshot
+ * (except that aborts will be shown as invisible without going to clog) and to
+ * ensure such behaviour XidInMVCCSnapshot is coated with asserts that checks
+ * identicalness of XidInvisibleInCSNSnapshot/XidInLocalMVCCSnapshot in
+ * case of ordinary snapshot.
+ */
+bool
+XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ XidCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ csn = TransactionIdGetXidCSN(xid);
+
+ if (XidCSNIsNormal(csn))
+ {
+ if (csn < snapshot->snapshot_csn)
+ return false;
+ else
+ return true;
+ }
+ else if (XidCSNIsFrozen(csn))
+ {
+ /* It is bootstrap or frozen transaction */
+ return false;
+ }
+ else
+ {
+ /* It is aborted or in-progress */
+ Assert(XidCSNIsAborted(csn) || XidCSNIsInProgress(csn));
+ if (XidCSNIsAborted(csn))
+ Assert(TransactionIdDidAbort(xid));
+ return true;
+ }
+}
+
+
+/*****************************************************************************
+ * Functions to handle transactions commit.
+ *
+ * For local transactions CSNSnapshotPrecommit sets InDoubt state before
+ * ProcArrayEndTransaction is called and transaction data potetntially becomes
+ * visible to other backends. ProcArrayEndTransaction (or ProcArrayRemove in
+ * twophase case) then acquires xid_csn under ProcArray lock and stores it
+ * in proc->assignedXidCsn. It's important that xid_csn for commit is
+ * generated under ProcArray lock, otherwise snapshots won't
+ * be equivalent. Consequent call to CSNSnapshotCommit will write
+ * proc->assignedXidCsn to CSNLog.
+ *
+ *
+ * CSNSnapshotAbort is slightly different comparing to commit because abort
+ * can skip InDoubt phase and can be called for transaction subtree.
+ *****************************************************************************/
+
+
+/*
+ * CSNSnapshotAbort
+ *
+ * Abort transaction in CsnLog. We can skip InDoubt state for aborts
+ * since no concurrent transactions allowed to see aborted data anyway.
+ */
+void
+CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+
+ /*
+ * Clean assignedXidCsn anyway, as it was possibly set in
+ * XidSnapshotAssignCsnCurrent.
+ */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
+
+/*
+ * CSNSnapshotPrecommit
+ *
+ * Set InDoubt status for local transaction that we are going to commit.
+ * This step is needed to achieve consistency between local snapshots and
+ * csn-based snapshots. We don't hold ProcArray lock while writing
+ * csn for transaction in SLRU but instead we set InDoubt status before
+ * transaction is deleted from ProcArray so the readers who will read csn
+ * in the gap between ProcArray removal and XidCSN assignment can wait
+ * until XidCSN is finally assigned. See also TransactionIdGetXidCSN().
+ *
+ * This should be called only from parallel group leader before backend is
+ * deleted from ProcArray.
+ */
+void
+CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ XidCSN oldassignedXidCsn = InProgressXidCSN;
+ bool in_progress;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /* Set InDoubt status if it is local transaction */
+ in_progress = pg_atomic_compare_exchange_u64(&proc->assignedXidCsn,
+ &oldassignedXidCsn,
+ InDoubtXidCSN);
+ if (in_progress)
+ {
+ Assert(XidCSNIsInProgress(oldassignedXidCsn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, InDoubtXidCSN);
+ }
+ else
+ {
+ /* Otherwise we should have valid XidCSN by this time */
+ Assert(XidCSNIsNormal(oldassignedXidCsn));
+ Assert(XidCSNIsInDoubt(CSNLogGetCSNByXid(xid)));
+ }
+}
+
+/*
+ * CSNSnapshotCommit
+ *
+ * Write XidCSN that were acquired earlier to CsnLog. Should be
+ * preceded by CSNSnapshotPrecommit() so readers can wait until we finally
+ * finished writing to SLRU.
+ *
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, so that TransactionIdGetXidCSN can wait on this
+ * lock for XidCSN.
+ */
+void
+CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ volatile XidCSN assigned_xid_csn;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ if (!TransactionIdIsValid(xid))
+ {
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsInProgress(assigned_xid_csn));
+ return;
+ }
+
+ /* Finally write resulting XidCSN in SLRU */
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsNormal(assigned_xid_csn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, assigned_xid_csn);
+
+ /* Reset for next transaction */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e1904877fa..af5f388c12 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,8 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
+#include "access/csn_log.h"
#include "access/htup_details.h"
#include "access/subtrans.h"
#include "access/transam.h"
@@ -1479,8 +1481,34 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nabortrels, abortrels,
gid);
+ /*
+ * CSNSnapshot callbacks that should be called right before we are
+ * going to become visible. Details in comments to this functions.
+ */
+ if (isCommit)
+ CSNSnapshotPrecommit(proc, xid, hdr->nsubxacts, children);
+ else
+ CSNSnapshotAbort(proc, xid, hdr->nsubxacts, children);
+
+
ProcArrayRemove(proc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CSNLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, since TransactionIdGetXidCSN relies on
+ * XactLockTableWait to await xid_csn.
+ */
+ if (isCommit)
+ {
+ CSNSnapshotCommit(proc, xid, hdr->nsubxacts, children);
+ }
+ else
+ {
+ Assert(XidCSNIsInProgress(
+ pg_atomic_read_u64(&proc->assignedXidCsn)));
+ }
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e..b045ed09f3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/xact.h"
@@ -173,6 +174,7 @@ GetNewTransactionId(bool isSubXact)
* Extend pg_subtrans and pg_commit_ts too.
*/
ExtendCLOG(xid);
+ ExtendCSNLog(xid);
ExtendCommitTs(xid);
ExtendSUBTRANS(xid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd30b62d36..8dcf951954 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
@@ -1433,6 +1434,14 @@ RecordTransactionCommit(void)
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
+
+ /*
+ * Mark our transaction as InDoubt in CsnLog and get ready for
+ * commit.
+ */
+ if (markXidCommitted)
+ CSNSnapshotPrecommit(MyProc, xid, nchildren, children);
+
cleanup:
/* Clean up local data */
if (rels)
@@ -1694,6 +1703,11 @@ RecordTransactionAbort(bool isSubXact)
*/
TransactionIdAbortTree(xid, nchildren, children);
+ /*
+ * Mark our transaction as Aborted in CsnLog.
+ */
+ CSNSnapshotAbort(MyProc, xid, nchildren, children);
+
END_CRIT_SECTION();
/* Compute latestXid while we have the child XIDs handy */
@@ -2183,6 +2197,21 @@ CommitTransaction(void)
*/
ProcArrayEndTransaction(MyProc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CsnLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks.
+ */
+ if (!is_parallel_worker)
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+ TransactionId *subxids;
+ int nsubxids;
+
+ nsubxids = xactGetCommittedChildren(&subxids);
+ CSNSnapshotCommit(MyProc, xid, nsubxids, subxids);
+ }
+
/*
* This is all post-commit cleanup. Note that if an error is raised here,
* it's too late to abort the transaction. This should be just
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..5b41aa58c3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
@@ -5342,6 +5343,7 @@ BootStrapXLOG(void)
/* Bootstrap the commit log, too */
BootStrapCLOG();
+ BootStrapCSNLog();
BootStrapCommitTs();
BootStrapSUBTRANS();
BootStrapMultiXact();
@@ -7059,6 +7061,7 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7876,6 +7879,7 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -8523,6 +8527,7 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
ShutdownCLOG();
+ ShutdownCSNLog();
ShutdownCommitTs();
ShutdownSUBTRANS();
ShutdownMultiXact();
@@ -9095,7 +9100,10 @@ CreateCheckPoint(int flags)
* StartupSUBTRANS hasn't been called yet.
*/
if (!RecoveryInProgress())
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update stats before releasing lock. */
LogCheckpointEnd(false);
@@ -9171,6 +9179,7 @@ static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointCLOG();
+ CheckPointCSNLog();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
@@ -9455,7 +9464,10 @@ CreateRestartPoint(int flags)
* this because StartupSUBTRANS hasn't been called yet.
*/
if (EnableHotStandby)
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update before releasing lock. */
LogCheckpointEnd(true);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..7122babfd6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,11 +16,13 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -125,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
size = add_size(size, CLOGShmemSize());
+ size = add_size(size, CSNLogShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
size = add_size(size, TwoPhaseShmemSize());
@@ -143,6 +146,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
+ size = add_size(size, CSNSnapshotShmemSize());
size = add_size(size, SnapMgrShmemSize());
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
@@ -213,6 +217,7 @@ CreateSharedMemoryAndSemaphores(void)
*/
XLOGShmemInit();
CLOGShmemInit();
+ CSNLogShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
@@ -264,6 +269,7 @@ CreateSharedMemoryAndSemaphores(void)
SyncScanShmemInit();
AsyncShmemInit();
+ CSNSnapshotShmemInit();
#ifdef EXEC_BACKEND
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 3c2b369615..5f491cf6e9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,8 @@
#include <signal.h>
#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -352,6 +354,14 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT PREPARED. After lock is released consequent
+ * CSNSnapshotCommit() will write this value to CsnLog.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
else
{
@@ -467,6 +477,16 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT.
+ *
+ * TODO: in case of group commit we can generate one CSNSnapshot for
+ * whole group to save time on timestamp aquisition.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
/*
@@ -833,6 +853,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
{
ExtendSUBTRANS(latestObservedXid);
+ ExtendCSNLog(latestObservedXid);
TransactionIdAdvance(latestObservedXid);
}
TransactionIdRetreat(latestObservedXid); /* = running->nextXid - 1 */
@@ -1511,6 +1532,7 @@ GetSnapshotData(Snapshot snapshot)
int count = 0;
int subcount = 0;
bool suboverflowed = false;
+ XidCSN xid_csn = FrozenXidCSN;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1708,6 +1730,13 @@ GetSnapshotData(Snapshot snapshot)
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
+ /*
+ * Take XidCSN under ProcArrayLock so the snapshot stays
+ * synchronized.
+ */
+ if (enable_csn_snapshot)
+ xid_csn = GenerateCSN(false);
+
LWLockRelease(ProcArrayLock);
/*
@@ -1778,6 +1807,8 @@ GetSnapshotData(Snapshot snapshot)
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
+ snapshot->snapshot_csn = xid_csn;
+
return snapshot;
}
@@ -3335,6 +3366,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
while (TransactionIdPrecedes(next_expected_xid, xid))
{
TransactionIdAdvance(next_expected_xid);
+ ExtendCSNLog(next_expected_xid);
ExtendSUBTRANS(next_expected_xid);
}
Assert(next_expected_xid == xid);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..77b8426e71 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -134,6 +134,8 @@ static const char *const BuiltinTrancheNames[] = {
"CommitTSBuffer",
/* LWTRANCHE_SUBTRANS_BUFFER: */
"SubtransBuffer",
+ /* LWTRANCHE_CSN_LOG_BUFFERS */
+ "CsnLogBuffer",
/* LWTRANCHE_MULTIXACTOFFSET_BUFFER: */
"MultiXactOffsetBuffer",
/* LWTRANCHE_MULTIXACTMEMBER_BUFFER: */
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6985e8eed..3c95ce4aac 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ MultiXactTruncationLock 41
OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
XactTruncationLock 44
+CSNLogControlLock 45
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f5eef6fa4e..da2868dd6f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -37,6 +37,7 @@
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -441,6 +442,8 @@ InitProcess(void)
MyProc->clogGroupMemberLsn = InvalidXLogRecPtr;
Assert(pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PGPROCNO);
+ pg_atomic_init_u64(&MyProc->assignedXidCsn, InProgressXidCSN);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 28b2fc72d6..6634804de6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/csn_snapshot.h"
#include "access/rmgr.h"
#include "access/tableam.h"
#include "access/transam.h"
@@ -1163,6 +1164,15 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_csn_snapshot", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Enable csn-base snapshot."),
+ gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
+ },
+ &enable_csn_snapshot,
+ true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ NULL, NULL, NULL
+ },
{
{"ssl", PGC_SIGHUP, CONN_AUTH_SSL,
gettext_noop("Enables SSL connections."),
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index a0b0458108..679c531622 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
probe clog__checkpoint__done(bool);
probe subtrans__checkpoint__start(bool);
probe subtrans__checkpoint__done(bool);
+ probe csnlog__checkpoint__start(bool);
+ probe csnlog__checkpoint__done(bool);
probe multixact__checkpoint__start(bool);
probe multixact__checkpoint__done(bool);
probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..e2baeb9222 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -229,6 +229,7 @@ static TimestampTz AlignTimestampToMinuteBoundary(TimestampTz ts);
static Snapshot CopySnapshot(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
+static bool XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot);
/*
* Snapshot fields to be serialized.
@@ -247,6 +248,7 @@ typedef struct SerializedSnapshotData
CommandId curcid;
TimestampTz whenTaken;
XLogRecPtr lsn;
+ XidCSN xid_csn;
} SerializedSnapshotData;
Size
@@ -2115,6 +2117,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
serialized_snapshot.curcid = snapshot->curcid;
serialized_snapshot.whenTaken = snapshot->whenTaken;
serialized_snapshot.lsn = snapshot->lsn;
+ serialized_snapshot.xid_csn = snapshot->snapshot_csn;
/*
* Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2189,6 +2192,7 @@ RestoreSnapshot(char *start_address)
snapshot->curcid = serialized_snapshot.curcid;
snapshot->whenTaken = serialized_snapshot.whenTaken;
snapshot->lsn = serialized_snapshot.lsn;
+ snapshot->snapshot_csn = serialized_snapshot.xid_csn;
/* Copy XIDs, if present. */
if (serialized_snapshot.xcnt > 0)
@@ -2229,6 +2233,47 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc)
/*
* XidInMVCCSnapshot
+ *
+ * Check whether this xid is in snapshot. When enable_csn_snapshot is
+ * switched off just call XidInLocalMVCCSnapshot().
+ */
+bool
+XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ bool in_snapshot;
+
+ in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
+
+ if (!enable_csn_snapshot)
+ {
+ Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
+ return in_snapshot;
+ }
+
+ if (in_snapshot)
+ {
+ /*
+ * This xid may be already in unknown state and in that case
+ * we must wait and recheck.
+ */
+ return XidInvisibleInCSNSnapshot(xid, snapshot);
+ }
+ else
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* Check that csn snapshot gives the same results as local one */
+ if (XidInvisibleInCSNSnapshot(xid, snapshot))
+ {
+ XidCSN gcsn = TransactionIdGetXidCSN(xid);
+ Assert(XidCSNIsAborted(gcsn));
+ }
+#endif
+ return false;
+ }
+}
+
+/*
+ * XidInLocalMVCCSnapshot
* Is the given XID still-in-progress according to the snapshot?
*
* Note: GetSnapshotData never stores either top xid or subxids of our own
@@ -2237,8 +2282,8 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc)
* TransactionIdIsCurrentTransactionId first, except when it's known the
* XID could not be ours anyway.
*/
-bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+static bool
+XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 786672b1b6..a52c01889d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -220,7 +220,8 @@ static const char *const subdirs[] = {
"pg_xact",
"pg_logical",
"pg_logical/snapshots",
- "pg_logical/mappings"
+ "pg_logical/mappings",
+ "pg_csn"
};
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..9b9611127d
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Commit-Sequence-Number log.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
new file mode 100644
index 0000000000..1894586204
--- /dev/null
+++ b/src/include/access/csn_snapshot.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.h
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_snapshot.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CSN_SNAPSHOT_H
+#define CSN_SNAPSHOT_H
+
+#include "port/atomics.h"
+#include "storage/lock.h"
+#include "utils/snapshot.h"
+#include "utils/guc.h"
+
+/*
+ * snapshot.h is used in frontend code so atomic variant of SnapshotCSN type
+ * is defined here.
+ */
+typedef pg_atomic_uint64 CSN_atomic;
+
+#define InProgressXidCSN UINT64CONST(0x0)
+#define AbortedXidCSN UINT64CONST(0x1)
+#define FrozenXidCSN UINT64CONST(0x2)
+#define InDoubtXidCSN UINT64CONST(0x3)
+#define FirstNormalXidCSN UINT64CONST(0x4)
+
+#define XidCSNIsInProgress(csn) ((csn) == InProgressXidCSN)
+#define XidCSNIsAborted(csn) ((csn) == AbortedXidCSN)
+#define XidCSNIsFrozen(csn) ((csn) == FrozenXidCSN)
+#define XidCSNIsInDoubt(csn) ((csn) == InDoubtXidCSN)
+#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
+
+
+
+
+extern Size CSNSnapshotShmemSize(void);
+extern void CSNSnapshotShmemInit(void);
+
+extern SnapshotCSN GenerateCSN(bool locked);
+
+extern bool XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot);
+
+extern XidCSN TransactionIdGetXidCSN(TransactionId xid);
+
+extern void CSNSnapshotAbort(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+
+#endif /* CSN_SNAPSHOT_H */
diff --git a/src/include/datatype/timestamp.h b/src/include/datatype/timestamp.h
index 6be6d35d1e..583b1beea5 100644
--- a/src/include/datatype/timestamp.h
+++ b/src/include/datatype/timestamp.h
@@ -93,6 +93,9 @@ typedef struct
#define USECS_PER_MINUTE INT64CONST(60000000)
#define USECS_PER_SEC INT64CONST(1000000)
+#define NSECS_PER_SEC INT64CONST(1000000000)
+#define NSECS_PER_USEC INT64CONST(1000)
+
/*
* We allow numeric timezone offsets up to 15:59:59 either way from Greenwich.
* Currently, the record holders for wackiest offsets in actual use are zones
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index d349510b7c..5cdf2e17cb 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -280,6 +280,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
#define PG_GETARG_FLOAT4(n) DatumGetFloat4(PG_GETARG_DATUM(n))
#define PG_GETARG_FLOAT8(n) DatumGetFloat8(PG_GETARG_DATUM(n))
#define PG_GETARG_INT64(n) DatumGetInt64(PG_GETARG_DATUM(n))
+#define PG_GETARG_UINT64(n) DatumGetUInt64(PG_GETARG_DATUM(n))
/* use this if you want the raw, possibly-toasted input datum: */
#define PG_GETARG_RAW_VARLENA_P(n) ((struct varlena *) PG_GETARG_POINTER(n))
/* use this if you want the input datum de-toasted: */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..4ac23da654 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -141,6 +141,9 @@ typedef struct timespec instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
#else /* !HAVE_CLOCK_GETTIME */
/* Use gettimeofday() */
@@ -205,6 +208,10 @@ typedef struct timeval instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + \
+ (uint64) (t).tv_usec * (uint64) 1000)
+
#endif /* HAVE_CLOCK_GETTIME */
#else /* WIN32 */
@@ -237,6 +244,9 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
static inline double
GetTimerFrequency(void)
{
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c04ae97148..3b9d248913 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -197,6 +197,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
LWTRANCHE_COMMITTS_BUFFER,
LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_CSN_LOG_BUFFERS,
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
LWTRANCHE_NOTIFY_BUFFER,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1ee9000b2b..8c8df6049e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,8 +15,10 @@
#define _PROC_H_
#include "access/clog.h"
+#include "access/csn_snapshot.h"
#include "access/xlogdefs.h"
#include "lib/ilist.h"
+#include "utils/snapshot.h"
#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
@@ -203,6 +205,16 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * assignedXidCsn holds XidCSN for this transaction. It is generated
+ * under a ProcArray lock and later is writter to a CSNLog. This
+ * variable defined as atomic only for case of group commit, in all other
+ * scenarios only backend responsible for this proc entry is working with
+ * this variable.
+ */
+ CSN_atomic assignedXidCsn;
+
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63a..9f622c76a7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -121,6 +121,9 @@ typedef enum SnapshotType
typedef struct SnapshotData *Snapshot;
#define InvalidSnapshot ((Snapshot) NULL)
+typedef uint64 XidCSN;
+typedef uint64 SnapshotCSN;
+extern bool enable_csn_snapshot;
/*
* Struct representing all kind of possible snapshots.
@@ -201,6 +204,12 @@ typedef struct SnapshotData
TimestampTz whenTaken; /* timestamp when snapshot was taken */
XLogRecPtr lsn; /* position in the WAL stream when taken */
+
+ /*
+ * SnapshotCSN for snapshot isolation support.
+ * Will be used only if enable_csn_snapshot is enabled.
+ */
+ SnapshotCSN snapshot_csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index a126f0ad61..86a5df0cba 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,6 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
+ enable_csn_snapshot | on
enable_gathermerge | on
enable_groupingsets_hash_disk | off
enable_hashagg | on
@@ -92,7 +93,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(20 rows)
+(21 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
0002-Wal-for-csn.patchapplication/octet-stream; name=0002-Wal-for-csn.patchDownload
Author: movead
Date: Fri Jun 12 17:13:36 2020 +0800
src/backend/access/rmgrdesc/Makefile | 1 +
src/backend/access/rmgrdesc/csnlogdesc.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/backend/access/rmgrdesc/xlogdesc.c | 6 ++++--
src/backend/access/transam/csn_log.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------
src/backend/access/transam/csn_snapshot.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
src/backend/access/transam/rmgr.c | 1 +
src/backend/access/transam/xlog.c | 13 +++++++++----
src/backend/commands/vacuum.c | 3 ++-
src/backend/storage/ipc/procarray.c | 2 +-
src/backend/utils/time/snapmgr.c | 2 +-
src/bin/pg_controldata/pg_controldata.c | 2 ++
src/bin/pg_upgrade/pg_upgrade.c | 5 +++++
src/bin/pg_upgrade/pg_upgrade.h | 2 ++
src/bin/pg_waldump/rmgrdesc.c | 1 +
src/include/access/csn_log.h | 29 +++++++++++++++++++++++++++--
src/include/access/rmgrlist.h | 1 +
src/include/access/xlog_internal.h | 1 +
src/include/catalog/pg_control.h | 1 +
18 files changed, 360 insertions(+), 64 deletions(-)
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..15fc36f7b4 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
brindesc.o \
clogdesc.o \
+ csnlogdesc.o \
committsdesc.o \
dbasedesc.o \
genericdesc.o \
diff --git a/src/backend/access/rmgrdesc/csnlogdesc.c b/src/backend/access/rmgrdesc/csnlogdesc.c
new file mode 100644
index 0000000000..e96b056325
--- /dev/null
+++ b/src/backend/access/rmgrdesc/csnlogdesc.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * clogdesc.c
+ * rmgr descriptor routines for access/transam/csn_log.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/csnlogdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+
+
+void
+csnlog_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ appendStringInfo(buf, "assign "INT64_FORMAT"", csn);
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) rec;
+ int nsubxids;
+
+ appendStringInfo(buf, "set "INT64_FORMAT" for: %u",
+ xlrec->xidcsn,
+ xlrec->xtop);
+ nsubxids = ((XLogRecGetDataLen(record) - MinSizeOfXidCSNSet) /
+ sizeof(TransactionId));
+ if (nsubxids > 0)
+ {
+ int i;
+ TransactionId *subxids;
+
+ subxids = palloc(sizeof(TransactionId) * nsubxids);
+ memcpy(subxids,
+ XLogRecGetData(record) + MinSizeOfXidCSNSet,
+ sizeof(TransactionId) * nsubxids);
+ for (i = 0; i < nsubxids; i++)
+ appendStringInfo(buf, ", %u", subxids[i]);
+ pfree(subxids);
+ }
+ }
+}
+
+const char *
+csnlog_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_CSN_ASSIGNMENT:
+ id = "ASSIGNMENT";
+ break;
+ case XLOG_CSN_SETXIDCSN:
+ id = "SETXIDCSN";
+ break;
+ case XLOG_CSN_ZEROPAGE:
+ id = "ZEROPAGE";
+ break;
+ case XLOG_CSN_TRUNCATE:
+ id = "TRUNCATE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..44e2e8ecec 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -114,7 +114,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
"max_wal_senders=%d max_prepared_xacts=%d "
"max_locks_per_xact=%d wal_level=%s "
- "wal_log_hints=%s track_commit_timestamp=%s",
+ "wal_log_hints=%s track_commit_timestamp=%s "
+ "enable_csn_snapshot=%s",
xlrec.MaxConnections,
xlrec.max_worker_processes,
xlrec.max_wal_senders,
@@ -122,7 +123,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
xlrec.max_locks_per_xact,
wal_level_str,
xlrec.wal_log_hints ? "on" : "off",
- xlrec.track_commit_timestamp ? "on" : "off");
+ xlrec.track_commit_timestamp ? "on" : "off",
+ xlrec.enable_csn_snapshot ? "on" : "off");
}
else if (info == XLOG_FPW_CHANGE)
{
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 4e0b8d64e4..4577e61fc3 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -9,6 +9,11 @@
* transactions. Because of same lifetime and persistancy requirements
* this module is quite similar to subtrans.c
*
+ * If we switch database from CSN-base snapshot to xid-base snapshot then,
+ * nothing wrong. But if we switch xid-base snapshot to CSN-base snapshot
+ * it should decide a new xid whwich begin csn-base check. It can not be
+ * oldestActiveXID because of prepared transaction.
+ *
* Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -52,7 +57,8 @@ bool enable_csn_snapshot;
static SlruCtlData CSNLogCtlData;
#define CsnlogCtl (&CSNLogCtlData)
-static int ZeroCSNLogPage(int pageno);
+static int ZeroCSNLogPage(int pageno, bool write_xlog);
+static void ZeroTruncateCSNLogPage(int pageno, bool write_xlog);
static bool CSNLogPagePrecedes(int page1, int page2);
static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids,
@@ -60,6 +66,11 @@ static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
int slotno);
+static void WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+static void WriteZeroCSNPageXlogRec(int pageno);
+static void WriteTruncateCSNXlogRec(int pageno);
+
/*
* CSNLogSetCSN
*
@@ -77,7 +88,7 @@ static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
*/
void
CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn)
+ TransactionId *subxids, XidCSN csn, bool write_xlog)
{
int pageno;
int i = 0;
@@ -89,6 +100,10 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
+
+ if(write_xlog)
+ WriteXidCsnXlogRec(xid, nsubxids, subxids, csn);
+
for (;;)
{
int num_on_page = 0;
@@ -180,11 +195,7 @@ CSNLogGetCSNByXid(TransactionId xid)
/* Callers of CSNLogGetCSNByXid() must check GUC params */
Assert(enable_csn_snapshot);
- /* Can't ask about stuff that might not be around anymore */
- Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
-
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
xid_csn = *ptr;
@@ -245,7 +256,7 @@ BootStrapCSNLog(void)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0);
+ slotno = ZeroCSNLogPage(0, false);
/* Make sure it's written out */
SimpleLruWritePage(CsnlogCtl, slotno);
@@ -263,50 +274,20 @@ BootStrapCSNLog(void)
* Control lock must be held at entry, and will be held at exit.
*/
static int
-ZeroCSNLogPage(int pageno)
+ZeroCSNLogPage(int pageno, bool write_xlog)
{
Assert(LWLockHeldByMe(CSNLogControlLock));
+ if(write_xlog)
+ WriteZeroCSNPageXlogRec(pageno);
return SimpleLruZeroPage(CsnlogCtl, pageno);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend startup,
- * after StartupXLOG has initialized ShmemVariableCache->nextXid.
- *
- * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
- * if there are none.
- */
-void
-StartupCSNLog(TransactionId oldestActiveXID)
+static void
+ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
{
- int startPage;
- int endPage;
-
- if (!enable_csn_snapshot)
- return;
-
- /*
- * Since we don't expect pg_csn to be valid across crashes, we
- * initialize the currently-active page(s) to zeroes during startup.
- * Whenever we advance into a new page, ExtendCSNLog will likewise
- * zero the new page without regard to whatever was previously on disk.
- */
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- startPage = TransactionIdToPage(oldestActiveXID);
- endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
-
- while (startPage != endPage)
- {
- (void) ZeroCSNLogPage(startPage);
- startPage++;
- /* must account for wraparound */
- if (startPage > TransactionIdToPage(MaxTransactionId))
- startPage = 0;
- }
- (void) ZeroCSNLogPage(startPage);
-
- LWLockRelease(CSNLogControlLock);
+ if(write_xlog)
+ WriteTruncateCSNXlogRec(pageno);
+ SimpleLruTruncate(CsnlogCtl, pageno);
}
/*
@@ -379,7 +360,7 @@ ExtendCSNLog(TransactionId newestXact)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
- ZeroCSNLogPage(pageno);
+ ZeroCSNLogPage(pageno, !InRecovery);
LWLockRelease(CSNLogControlLock);
}
@@ -410,7 +391,7 @@ TruncateCSNLog(TransactionId oldestXact)
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
- SimpleLruTruncate(CsnlogCtl, cutoffPage);
+ ZeroTruncateCSNLogPage(cutoffPage, true);
}
/*
@@ -436,3 +417,115 @@ CSNLogPagePrecedes(int page1, int page2)
return TransactionIdPrecedes(xid1, xid2);
}
+
+void
+WriteAssignCSNXlogRec(XidCSN xidcsn)
+{
+ XidCSN log_csn = 0;
+
+ if(xidcsn > get_last_log_wal_csn())
+ {
+ log_csn = CSNAddByNanosec(xidcsn, 20);
+ set_last_log_wal_csn(log_csn);
+ }
+ else
+ {
+ return;
+ }
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&log_csn), sizeof(XidCSN));
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ASSIGNMENT);
+}
+
+static void
+WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ xl_xidcsn_set xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.xtop = xid;
+ xlrec.nsubxacts = nsubxids;
+ xlrec.xidcsn = csn;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
+ XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
+ XLogFlush(recptr);
+}
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroCSNPageXlogRec(int pageno)
+{
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ (void) XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ZEROPAGE);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateCSNXlogRec(int pageno)
+{
+ XLogRecPtr recptr;
+ return;
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
+ XLogFlush(recptr);
+}
+
+
+void
+csnlog_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ /* Backup blocks are not used in csnlog records */
+ Assert(!XLogRecHasAnyBlockRefs(record));
+
+ if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ set_last_max_csn(csn);
+ LWLockRelease(CSNLogControlLock);
+
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) XLogRecGetData(record);
+ CSNLogSetCSN(xlrec->xtop, xlrec->nsubxacts, xlrec->xsub, xlrec->xidcsn, false);
+ }
+ else if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+ int slotno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(pageno, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ CsnlogCtl->shared->latest_page_number = pageno;
+ ZeroTruncateCSNLogPage(pageno, false);
+ }
+ else
+ elog(PANIC, "csnlog_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index e2d4d2649e..a3d164d77e 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -31,6 +31,8 @@
/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+TransactionId xmin_for_csn = InvalidTransactionId;
+
/*
* CSNSnapshotState
*
@@ -40,7 +42,9 @@
*/
typedef struct
{
- SnapshotCSN last_max_csn;
+ SnapshotCSN last_max_csn; /* Record the max csn till now */
+ XidCSN last_csn_log_wal; /* for interval we log the assign csn to wal */
+ TransactionId xmin_for_csn; /*'xmin_for_csn' for when turn xid-snapshot to csn-snapshot*/
volatile slock_t lock;
} CSNSnapshotState;
@@ -80,6 +84,7 @@ CSNSnapshotShmemInit()
if (!found)
{
csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
SpinLockInit(&csnState->lock);
}
}
@@ -116,6 +121,8 @@ GenerateCSN(bool locked)
else
csnState->last_max_csn = csn;
+ WriteAssignCSNXlogRec(csn);
+
if (!locked)
SpinLockRelease(&csnState->lock);
@@ -131,7 +138,7 @@ GenerateCSN(bool locked)
XidCSN
TransactionIdGetXidCSN(TransactionId xid)
{
- XidCSN xid_csn;
+ XidCSN xid_csn;
Assert(enable_csn_snapshot);
@@ -145,13 +152,35 @@ TransactionIdGetXidCSN(TransactionId xid)
Assert(false); /* Should not happend */
}
+ /*
+ * If we just switch a xid-snapsot to a csn_snapshot, we should handle a start
+ * xid for csn basse check. Just in case we have prepared transaction which
+ * hold the TransactionXmin but without CSN.
+ */
+ if(InvalidTransactionId == xmin_for_csn)
+ {
+ SpinLockAcquire(&csnState->lock);
+ if(InvalidTransactionId != csnState->xmin_for_csn)
+ xmin_for_csn = csnState->xmin_for_csn;
+ else
+ xmin_for_csn = FrozenTransactionId;
+
+ SpinLockRelease(&csnState->lock);
+ }
+
+ if ( FrozenTransactionId != xmin_for_csn ||
+ TransactionIdPrecedes(xmin_for_csn, TransactionXmin))
+ {
+ xmin_for_csn = TransactionXmin;
+ }
+
/*
* For xids which less then TransactionXmin CSNLog can be already
* trimmed but we know that such transaction is definetly not concurrently
* running according to any snapshot including timetravel ones. Callers
* should check TransactionDidCommit after.
*/
- if (TransactionIdPrecedes(xid, TransactionXmin))
+ if (TransactionIdPrecedes(xid, xmin_for_csn))
return FrozenXidCSN;
/* Read XidCSN from SLRU */
@@ -251,7 +280,7 @@ CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
if (!enable_csn_snapshot)
return;
- CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
/*
* Clean assignedXidCsn anyway, as it was possibly set in
@@ -292,7 +321,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
{
Assert(XidCSNIsInProgress(oldassignedXidCsn));
CSNLogSetCSN(xid, nsubxids,
- subxids, InDoubtXidCSN);
+ subxids, InDoubtXidCSN, true);
}
else
{
@@ -333,8 +362,39 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
Assert(XidCSNIsNormal(assigned_xid_csn));
CSNLogSetCSN(xid, nsubxids,
- subxids, assigned_xid_csn);
+ subxids, assigned_xid_csn, true);
/* Reset for next transaction */
pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
}
+
+void
+set_last_max_csn(XidCSN xidcsn)
+{
+ csnState->last_max_csn = xidcsn;
+}
+
+void
+set_last_log_wal_csn(XidCSN xidcsn)
+{
+ csnState->last_csn_log_wal = xidcsn;
+}
+
+XidCSN
+get_last_log_wal_csn(void)
+{
+ XidCSN last_csn_log_wal;
+
+ last_csn_log_wal = csnState->last_csn_log_wal;
+
+ return last_csn_log_wal;
+}
+
+/*
+ * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ */
+void
+set_xmin_for_csn(void)
+{
+ csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..b1e5ec350e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -28,6 +28,7 @@
#include "replication/origin.h"
#include "storage/standby.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b41aa58c3..a6e6c760b1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4604,6 +4604,7 @@ InitControlFile(uint64 sysidentifier)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
ControlFile->data_checksum_version = bootstrap_data_checksum_version;
}
@@ -7061,7 +7062,6 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7879,7 +7879,6 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -9102,7 +9101,6 @@ CreateCheckPoint(int flags)
if (!RecoveryInProgress())
{
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
- TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
}
/* Real work is done, but log and update stats before releasing lock. */
@@ -9725,7 +9723,8 @@ XLogReportParameters(void)
max_wal_senders != ControlFile->max_wal_senders ||
max_prepared_xacts != ControlFile->max_prepared_xacts ||
max_locks_per_xact != ControlFile->max_locks_per_xact ||
- track_commit_timestamp != ControlFile->track_commit_timestamp)
+ track_commit_timestamp != ControlFile->track_commit_timestamp ||
+ enable_csn_snapshot != ControlFile->enable_csn_snapshot)
{
/*
* The change in number of backend slots doesn't need to be WAL-logged
@@ -9747,6 +9746,7 @@ XLogReportParameters(void)
xlrec.wal_level = wal_level;
xlrec.wal_log_hints = wal_log_hints;
xlrec.track_commit_timestamp = track_commit_timestamp;
+ xlrec.enable_csn_snapshot = enable_csn_snapshot;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9757,6 +9757,9 @@ XLogReportParameters(void)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
+ set_xmin_for_csn();
+
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -9765,6 +9768,7 @@ XLogReportParameters(void)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
@@ -10197,6 +10201,7 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d32de23e62..e782c1ba96 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -53,7 +53,7 @@
#include "utils/memutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
-
+#include "access/csn_log.h"
/*
* GUC parameters
@@ -1632,6 +1632,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
TruncateCLOG(frozenXID, oldestxid_datoid);
TruncateCommitTs(frozenXID);
+ TruncateCSNLog(frozenXID);
TruncateMultiXact(minMulti, minmulti_datoid);
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 5f491cf6e9..f8db77ccd7 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1734,7 +1734,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index e2baeb9222..218f32e8ec 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2265,7 +2265,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
if (XidInvisibleInCSNSnapshot(xid, snapshot))
{
XidCSN gcsn = TransactionIdGetXidCSN(xid);
- Assert(XidCSNIsAborted(gcsn));
+ Assert(XidCSNIsAborted(gcsn) || XidCSNIsInProgress(gcsn));
}
#endif
return false;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..e7194124c7 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -306,6 +306,8 @@ main(int argc, char *argv[])
ControlFile->max_locks_per_xact);
printf(_("track_commit_timestamp setting: %s\n"),
ControlFile->track_commit_timestamp ? _("on") : _("off"));
+ printf(_("enable_csn_snapshot setting: %s\n"),
+ ControlFile->enable_csn_snapshot ? _("on") : _("off"));
printf(_("Maximum data alignment: %u\n"),
ControlFile->maxAlign);
/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 70194eb096..863ee73d24 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -545,6 +545,11 @@ copy_xact_xlog_xid(void)
check_ok();
}
+ if(old_cluster.controldata.cat_ver > CSN_BASE_SNAPSHOT_ADD_VER)
+ {
+ copy_subdir_files("pg_csn", "pg_csn");
+ }
+
/* now reset the wal archives in the new cluster */
prep_status("Resetting WAL archives");
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 8b90cefbe0..f35860dfc5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -123,6 +123,8 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+#define CSN_BASE_SNAPSHOT_ADD_VER 202002010
+
/*
* Each relation is represented by a relinfo structure.
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..282bae882a 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -31,6 +31,7 @@
#include "rmgrdesc.h"
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
{ name, desc, identify},
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 9b9611127d..b973e0c2ce 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -14,17 +14,42 @@
#include "access/xlog.h"
#include "utils/snapshot.h"
+/* XLOG stuff */
+#define XLOG_CSN_ASSIGNMENT 0x00
+#define XLOG_CSN_SETXIDCSN 0x10
+#define XLOG_CSN_ZEROPAGE 0x20
+#define XLOG_CSN_TRUNCATE 0x30
+
+typedef struct xl_xidcsn_set
+{
+ XidCSN xidcsn;
+ TransactionId xtop; /* XID's top-level XID */
+ int nsubxacts; /* number of subtransaction XIDs */
+ TransactionId xsub[FLEXIBLE_ARRAY_MEMBER]; /* assigned subxids */
+} xl_xidcsn_set;
+
+#define MinSizeOfXidCSNSet offsetof(xl_xidcsn_set, xsub)
+#define CSNAddByNanosec(csn,second) (csn + second * 1000000000L)
+
extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn);
+ TransactionId *subxids, XidCSN csn, bool write_xlog);
extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
extern Size CSNLogShmemSize(void);
extern void CSNLogShmemInit(void);
extern void BootStrapCSNLog(void);
-extern void StartupCSNLog(TransactionId oldestActiveXID);
extern void ShutdownCSNLog(void);
extern void CheckPointCSNLog(void);
extern void ExtendCSNLog(TransactionId newestXact);
extern void TruncateCSNLog(TransactionId oldestXact);
+extern void csnlog_redo(XLogReaderState *record);
+extern void csnlog_desc(StringInfo buf, XLogReaderState *record);
+extern const char *csnlog_identify(uint8 info);
+extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
+extern void set_last_max_csn(XidCSN xidcsn);
+extern void set_last_log_wal_csn(XidCSN xidcsn);
+extern XidCSN get_last_log_wal_csn(void);
+extern void set_xmin_for_csn(void);
+
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 6c15df7e70..b2d12bfb27 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CSNLOG_ID, "CSN", csnlog_redo, csnlog_desc, csnlog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index c8869d5226..729cf5bc56 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -236,6 +236,7 @@ typedef struct xl_parameter_change
int wal_level;
bool wal_log_hints;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
} xl_parameter_change;
/* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..9e5d4b0fc0 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -181,6 +181,7 @@ typedef struct ControlFileData
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
/*
* This data is used to check for hardware-architecture compatibility of
0003-snapshot-switch.patchapplication/octet-stream; name=0003-snapshot-switch.patchDownload
Author: movead
Date: Fri Jun 12 17:20:21 2020 +0800
doc/src/sgml/config.sgml | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
src/backend/access/transam/csn_log.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------
src/backend/access/transam/csn_snapshot.c | 56 ++++++++++++++++++++++++++++++++------------------------
src/backend/access/transam/xlog.c | 10 ++++++++--
src/backend/storage/ipc/procarray.c | 2 +-
src/backend/utils/misc/guc.c | 2 +-
src/backend/utils/misc/postgresql.conf.sample | 2 ++
src/backend/utils/time/snapmgr.c | 3 ++-
src/include/access/csn_log.h | 9 ++++++++-
src/test/modules/Makefile | 1 +
src/test/modules/csnsnapshot/Makefile | 18 ++++++++++++++++++
src/test/modules/csnsnapshot/csn_snapshot.conf | 1 +
src/test/modules/csnsnapshot/expected/csnsnapshot.out | 1 +
src/test/modules/csnsnapshot/sql/csnsnapshot.sql | 1 +
src/test/modules/csnsnapshot/t/001_base.pl | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/test/modules/csnsnapshot/t/002_standby.pl | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
src/test/regress/expected/sysviews.out | 2 +-
17 files changed, 410 insertions(+), 75 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2908821560..a50ad8bfbe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9083,8 +9083,56 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</varlistentry>
</variablelist>
- </sect1>
+ <sect2 id="runtime-config-CSN-base-snapshot">
+ <title>CSN Based Snapshot</title>
+
+ <para>
+ By default, The snapshots in <productname>PostgreSQL</productname> uses the
+ XID (TransactionID) to identify the status of the transaction, the in-progress
+ transactions, and the future transactions for all its visibility calculations.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> also provides the CSN (commit-sequence-number)
+ based mechanism to identify the past-transactions and the ones that are yet to
+ be started/committed.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-enable-csn-snapshot" xreflabel="enable_csn_snapshot">
+ <term><varname>enable_csn_snapshot</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_csn_snapshot</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+
+ <para>
+ Enable/disable the CSN based transaction visibility tracking for the snapshot.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> uses the clock timestamp as a CSN,
+ so enabling the CSN based snapshots can be useful for implementing the global
+ snapshots and global transaction visibility.
+ </para>
+
+ <para>
+ when enabled <productname>PostgreSQL</productname> creates
+ <filename>pg_csn</filename> directory under <envar>PGDATA</envar> to keep
+ the track of CSN and XID mappings.
+ </para>
+
+ <para>
+ The default value is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect2>
+ </sect1>
<sect1 id="runtime-config-compatible">
<title>Version and Platform Compatibility</title>
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 4577e61fc3..319e89c805 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -30,9 +30,28 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
+#include "storage/shmem.h"
bool enable_csn_snapshot;
+/*
+ * We use csnSnapshotActive to judge if csn snapshot enabled instead of by
+ * enable_csn_snapshot, this design is similar to 'track_commit_timestamp'.
+ *
+ * Because in process of replication if master change 'enable_csn_snapshot'
+ * in a database restart, standby should apply wal record for GUC changed,
+ * then it's difficult to notice all backends about that. So they can get
+ * the message by 'csnSnapshotActive' which in share buffer. It will not
+ * acquire a lock, so without performance issue.
+ *
+ */
+typedef struct CSNshapshotShared
+{
+ bool csnSnapshotActive;
+} CSNshapshotShared;
+
+CSNshapshotShared *csnShared = NULL;
+
/*
* Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
* everywhere else in Postgres.
@@ -94,9 +113,6 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
int i = 0;
int offset = 0;
- /* Callers of CSNLogSetCSN() must check GUC params */
- Assert(enable_csn_snapshot);
-
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
@@ -167,7 +183,7 @@ static void
CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
{
int entryno = TransactionIdToPgIndex(xid);
- XidCSN *ptr;
+ XidCSN *ptr;
Assert(LWLockHeldByMe(CSNLogControlLock));
@@ -192,9 +208,6 @@ CSNLogGetCSNByXid(TransactionId xid)
XidCSN *ptr;
XidCSN xid_csn;
- /* Callers of CSNLogGetCSNByXid() must check GUC params */
- Assert(enable_csn_snapshot);
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
@@ -220,9 +233,6 @@ CSNLogShmemBuffers(void)
Size
CSNLogShmemSize(void)
{
- if (!enable_csn_snapshot)
- return 0;
-
return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
}
@@ -232,37 +242,25 @@ CSNLogShmemSize(void)
void
CSNLogShmemInit(void)
{
- if (!enable_csn_snapshot)
- return;
+ bool found;
+
CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+
+ csnShared = ShmemInitStruct("CSNlog shared",
+ sizeof(CSNshapshotShared),
+ &found);
}
/*
- * This func must be called ONCE on system install. It creates the initial
- * CSNLog segment. The pg_csn directory is assumed to have been
- * created by initdb, and CSNLogShmemInit must have been called already.
+ * See ActivateCSNlog
*/
void
BootStrapCSNLog(void)
{
- int slotno;
-
- if (!enable_csn_snapshot)
- return;
-
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- /* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0, false);
-
- /* Make sure it's written out */
- SimpleLruWritePage(CsnlogCtl, slotno);
- Assert(!CsnlogCtl->shared->page_dirty[slotno]);
-
- LWLockRelease(CSNLogControlLock);
+ return;
}
/*
@@ -290,13 +288,94 @@ ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
SimpleLruTruncate(CsnlogCtl, pageno);
}
+void
+ActivateCSNlog(void)
+{
+ int startPage;
+ TransactionId nextXid = InvalidTransactionId;
+
+ if (csnShared->csnSnapshotActive)
+ return;
+
+
+ nextXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ startPage = TransactionIdToPage(nextXid);
+
+ /* Create the current segment file, if necessary */
+ if (!SimpleLruDoesPhysicalPageExist(CsnlogCtl, startPage))
+ {
+ int slotno;
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(startPage, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ }
+ csnShared->csnSnapshotActive = true;
+}
+
+bool
+get_csnlog_status(void)
+{
+ if(!csnShared)
+ {
+ /* Should not arrived */
+ elog(ERROR, "We do not have csnShared point");
+ }
+ return csnShared->csnSnapshotActive;
+}
+
+void
+DeactivateCSNlog(void)
+{
+ csnShared->csnSnapshotActive = false;
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ (void) SlruScanDirectory(CsnlogCtl, SlruScanDirCbDeleteAll, NULL);
+ LWLockRelease(CSNLogControlLock);
+}
+
+void
+StartupCSN(void)
+{
+ ActivateCSNlog();
+}
+
+void
+CompleteCSNInitialization(void)
+{
+ /*
+ * If the feature is not enabled, turn it off for good. This also removes
+ * any leftover data.
+ *
+ * Conversely, we activate the module if the feature is enabled. This is
+ * necessary for primary and standby as the activation depends on the
+ * control file contents at the beginning of recovery or when a
+ * XLOG_PARAMETER_CHANGE is replayed.
+ */
+ if (!get_csnlog_status())
+ DeactivateCSNlog();
+ else
+ ActivateCSNlog();
+}
+
+void
+CSNlogParameterChange(bool newvalue, bool oldvalue)
+{
+ if (newvalue)
+ {
+ if (!csnShared->csnSnapshotActive)
+ ActivateCSNlog();
+ }
+ else if (csnShared->csnSnapshotActive)
+ DeactivateCSNlog();
+}
+
/*
* This must be called ONCE during postmaster or standalone-backend shutdown
*/
void
ShutdownCSNLog(void)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -316,7 +395,7 @@ ShutdownCSNLog(void)
void
CheckPointCSNLog(void)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -344,7 +423,7 @@ ExtendCSNLog(TransactionId newestXact)
{
int pageno;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -375,9 +454,9 @@ ExtendCSNLog(TransactionId newestXact)
void
TruncateCSNLog(TransactionId oldestXact)
{
- int cutoffPage;
+ int cutoffPage;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -390,7 +469,6 @@ TruncateCSNLog(TransactionId oldestXact)
*/
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
-
ZeroTruncateCSNLogPage(cutoffPage, true);
}
@@ -443,7 +521,6 @@ WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
TransactionId *subxids, XidCSN csn)
{
xl_xidcsn_set xlrec;
- XLogRecPtr recptr;
xlrec.xtop = xid;
xlrec.nsubxacts = nsubxids;
@@ -452,8 +529,7 @@ WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
- recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
- XLogFlush(recptr);
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
}
/*
@@ -473,12 +549,9 @@ WriteZeroCSNPageXlogRec(int pageno)
static void
WriteTruncateCSNXlogRec(int pageno)
{
- XLogRecPtr recptr;
- return;
XLogBeginInsert();
XLogRegisterData((char *) (&pageno), sizeof(int));
- recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
- XLogFlush(recptr);
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index ec090fd499..d7d0b5e90f 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -62,10 +62,7 @@ CSNSnapshotShmemSize(void)
{
Size size = 0;
- if (enable_csn_snapshot)
- {
- size += MAXALIGN(sizeof(CSNSnapshotState));
- }
+ size += MAXALIGN(sizeof(CSNSnapshotState));
return size;
}
@@ -76,17 +73,14 @@ CSNSnapshotShmemInit()
{
bool found;
- if (enable_csn_snapshot)
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
{
- csnState = ShmemInitStruct("csnState",
- sizeof(CSNSnapshotState),
- &found);
- if (!found)
- {
- csnState->last_max_csn = 0;
- csnState->last_csn_log_wal = 0;
- SpinLockInit(&csnState->lock);
- }
+ csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
+ SpinLockInit(&csnState->lock);
}
}
@@ -104,7 +98,7 @@ GenerateCSN(bool locked)
instr_time current_time;
SnapshotCSN csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
/*
* TODO: create some macro that add small random shift to current time.
@@ -140,7 +134,7 @@ TransactionIdGetXidCSN(TransactionId xid)
{
XidCSN xid_csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
/* Handle permanent TransactionId's for which we don't have mapping */
if (!TransactionIdIsNormal(xid))
@@ -222,7 +216,7 @@ XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
{
XidCSN csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
csn = TransactionIdGetXidCSN(xid);
@@ -277,7 +271,7 @@ void
CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
int nsubxids, TransactionId *subxids)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
@@ -310,7 +304,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
XidCSN oldassignedXidCsn = InProgressXidCSN;
bool in_progress;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/* Set InDoubt status if it is local transaction */
@@ -348,7 +342,7 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
{
volatile XidCSN assigned_xid_csn;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
if (!TransactionIdIsValid(xid))
@@ -391,10 +385,24 @@ get_last_log_wal_csn(void)
}
/*
- * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ *
*/
void
-set_xmin_for_csn(void)
+prepare_csn_env(bool enable)
{
- csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-}
+ TransactionId nextxid = InvalidTransactionId;
+
+ if(enable)
+ {
+ nextxid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ /* 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot */
+ csnState->xmin_for_csn = nextxid;
+ /* produce the csnlog segment we want now and seek to current page */
+ ActivateCSNlog();
+ }
+ else
+ {
+ /* Try to drop all csnlog seg */
+ DeactivateCSNlog();
+ }
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e6c760b1..e69085fdc0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,6 +79,7 @@
#include "utils/relmapper.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "access/csn_log.h"
extern uint32 bootstrap_data_checksum_version;
@@ -6802,6 +6803,9 @@ StartupXLOG(void)
if (ControlFile->track_commit_timestamp)
StartupCommitTs();
+ if(ControlFile->enable_csn_snapshot)
+ StartupCSN();
+
/*
* Recover knowledge about replay progress of known replication partners.
*/
@@ -7918,6 +7922,7 @@ StartupXLOG(void)
* commit timestamp.
*/
CompleteCommitTsInitialization();
+ CompleteCSNInitialization();
/*
* All done with end-of-recovery actions.
@@ -9758,8 +9763,7 @@ XLogReportParameters(void)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
- set_xmin_for_csn();
-
+ prepare_csn_env(enable_csn_snapshot);
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -10201,6 +10205,8 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ CSNlogParameterChange(xlrec.enable_csn_snapshot,
+ ControlFile->enable_csn_snapshot);
ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index f8db77ccd7..e066801933 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1734,7 +1734,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && get_csnlog_status())
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6634804de6..e458ef6a09 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1170,7 +1170,7 @@ static struct config_bool ConfigureNamesBool[] =
gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
},
&enable_csn_snapshot,
- true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ false,
NULL, NULL, NULL
},
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3a25287a39..092a2743cd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -296,6 +296,8 @@
# (change requires restart)
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
+#enable_csn_snapshot = off # enable csn base snapshot
+ # (change requires restart)
# - Master Server -
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 218f32e8ec..b50b8cbd30 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -52,6 +52,7 @@
#include "access/transam.h"
#include "access/xact.h"
#include "access/xlog.h"
+#include "access/csn_log.h"
#include "catalog/catalog.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
@@ -2244,7 +2245,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
{
Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
return in_snapshot;
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index b973e0c2ce..5838028a30 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -50,6 +50,13 @@ extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
extern void set_last_max_csn(XidCSN xidcsn);
extern void set_last_log_wal_csn(XidCSN xidcsn);
extern XidCSN get_last_log_wal_csn(void);
-extern void set_xmin_for_csn(void);
+extern void prepare_csn_env(bool enable_csn_snapshot);
+extern void CatchCSNLog(void);
+extern void ActivateCSNlog(void);
+extern void DeactivateCSNlog(void);
+extern void StartupCSN(void);
+extern void CompleteCSNInitialization(void);
+extern void CSNlogParameterChange(bool newvalue, bool oldvalue);
+extern bool get_csnlog_status(void);
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 29de73c060..86e114e934 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
SUBDIRS = \
brin \
commit_ts \
+ csnsnapshot \
dummy_index_am \
dummy_seclabel \
snapshot_too_old \
diff --git a/src/test/modules/csnsnapshot/Makefile b/src/test/modules/csnsnapshot/Makefile
new file mode 100644
index 0000000000..45c4221cd0
--- /dev/null
+++ b/src/test/modules/csnsnapshot/Makefile
@@ -0,0 +1,18 @@
+# src/test/modules/csnsnapshot/Makefile
+
+REGRESS = csnsnapshot
+REGRESS_OPTS = --temp-config=$(top_srcdir)/src/test/modules/csnsnapshot/csn_snapshot.conf
+NO_INSTALLCHECK = 1
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/csnsnapshot
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/csnsnapshot/csn_snapshot.conf b/src/test/modules/csnsnapshot/csn_snapshot.conf
new file mode 100644
index 0000000000..e9d3c35756
--- /dev/null
+++ b/src/test/modules/csnsnapshot/csn_snapshot.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
diff --git a/src/test/modules/csnsnapshot/expected/csnsnapshot.out b/src/test/modules/csnsnapshot/expected/csnsnapshot.out
new file mode 100644
index 0000000000..ac28e417b6
--- /dev/null
+++ b/src/test/modules/csnsnapshot/expected/csnsnapshot.out
@@ -0,0 +1 @@
+create table t1(i int, j int, k varchar);
diff --git a/src/test/modules/csnsnapshot/sql/csnsnapshot.sql b/src/test/modules/csnsnapshot/sql/csnsnapshot.sql
new file mode 100644
index 0000000000..91539b8c30
--- /dev/null
+++ b/src/test/modules/csnsnapshot/sql/csnsnapshot.sql
@@ -0,0 +1 @@
+create table t1(i int, j int, k varchar);
\ No newline at end of file
diff --git a/src/test/modules/csnsnapshot/t/001_base.pl b/src/test/modules/csnsnapshot/t/001_base.pl
new file mode 100644
index 0000000000..1c91f4d9f7
--- /dev/null
+++ b/src/test/modules/csnsnapshot/t/001_base.pl
@@ -0,0 +1,102 @@
+# Single-node test: value can be set, and is still present after recovery
+
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 5;
+use PostgresNode;
+
+my $node = get_new_node('csntest');
+$node->init;
+$node->append_conf('postgresql.conf', qq{
+ enable_csn_snapshot = on
+ csn_snapshot_defer_time = 10
+ max_prepared_transactions = 10
+ });
+$node->start;
+
+my $test_1 = 1;
+
+# Create a table
+$node->safe_psql('postgres', 'create table t1(i int, j int)');
+
+# insert test record
+$node->safe_psql('postgres', 'insert into t1 values(1,1)');
+# export csn snapshot
+my $test_snapshot = $node->safe_psql('postgres', 'select pg_csn_snapshot_export()');
+# insert test record
+$node->safe_psql('postgres', 'insert into t1 values(2,1)');
+
+my $count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '2', 'Get right number in nomal query');
+my $count2 = $node->safe_psql('postgres', "
+ begin transaction isolation level repeatable read;
+ select pg_csn_snapshot_import($test_snapshot);
+ select count(*) from t1;
+ commit;"
+ );
+
+is($count2, '
+1', 'Get right number in csn import query');
+
+#prepare transaction test
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(3,1);
+ insert into t1 values(3,2);
+ prepare transaction 'pt3';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(4,1);
+ insert into t1 values(4,2);
+ prepare transaction 'pt4';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(5,1);
+ insert into t1 values(5,2);
+ prepare transaction 'pt5';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(6,1);
+ insert into t1 values(6,2);
+ prepare transaction 'pt6';
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt4';");
+
+# restart with enable_csn_snapshot off
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = off");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(7,1);
+ insert into t1 values(7,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt3';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '8', 'Get right number in nomal query');
+
+
+# restart with enable_csn_snapshot on
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = on");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(8,1);
+ insert into t1 values(8,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt5';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '12', 'Get right number in nomal query');
+
+# restart with enable_csn_snapshot off
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = on");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(9,1);
+ insert into t1 values(9,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt6';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '16', 'Get right number in nomal query');
diff --git a/src/test/modules/csnsnapshot/t/002_standby.pl b/src/test/modules/csnsnapshot/t/002_standby.pl
new file mode 100644
index 0000000000..b7c4ea93b2
--- /dev/null
+++ b/src/test/modules/csnsnapshot/t/002_standby.pl
@@ -0,0 +1,66 @@
+# Test simple scenario involving a standby
+
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 6;
+use PostgresNode;
+
+my $bkplabel = 'backup';
+my $master = get_new_node('master');
+$master->init(allows_streaming => 1);
+
+$master->append_conf(
+ 'postgresql.conf', qq{
+ enable_csn_snapshot = on
+ max_wal_senders = 5
+ });
+$master->start;
+$master->backup($bkplabel);
+
+my $standby = get_new_node('standby');
+$standby->init_from_backup($master, $bkplabel, has_streaming => 1);
+$standby->start;
+
+$master->safe_psql('postgres', "create table t1(i int, j int)");
+
+my $guc_on_master = $master->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_master, 'on', "GUC on master");
+
+my $guc_on_standby = $standby->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_standby, 'on', "GUC on standby");
+
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = off');
+$master->restart;
+
+$guc_on_master = $master->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_master, 'off', "GUC off master");
+
+$guc_on_standby = $standby->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_standby, 'on', "GUC on standby");
+
+# We consume a large number of transaction,for skip page
+for my $i (1 .. 4096) #4096
+{
+ $master->safe_psql('postgres', "insert into t1 values(1,$i)");
+}
+$master->safe_psql('postgres', "select pg_sleep(2)");
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = on');
+$master->restart;
+
+my $count_standby = $standby->safe_psql('postgres', 'select count(*) from t1');
+is($count_standby, '4096', "Ok for siwtch xid-base > csn-base"); #4096
+
+# We consume a large number of transaction,for skip page
+for my $i (1 .. 4096) #4096
+{
+ $master->safe_psql('postgres', "insert into t1 values(1,$i)");
+}
+$master->safe_psql('postgres', "select pg_sleep(2)");
+
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = off');
+$master->restart;
+
+$count_standby = $standby->safe_psql('postgres', 'select count(*) from t1');
+is($count_standby, '8192', "Ok for siwtch csn-base > xid-base"); #8192
\ No newline at end of file
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 86a5df0cba..c9118db2b0 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,7 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
- enable_csn_snapshot | on
+ enable_csn_snapshot | off
enable_gathermerge | on
enable_groupingsets_hash_disk | off
enable_hashagg | on
On Fri, Jun 12, 2020 at 3:11 PM movead.li@highgo.ca <movead.li@highgo.ca> wrote:
Hello hackers,
Currently, I do some changes based on the last version:
1. Catch up to the current commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.
AFAIU, this patch is to improve scalability and also will be helpful
for Global Snapshots stuff, is that right? If so, how much
performance/scalability benefit this patch will have after Andres's
recent work on scalability [1]/messages/by-id/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de?
[1]: /messages/by-id/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2020/06/12 18:41, movead.li@highgo.ca wrote:
Hello hackers,
Currently, I do some changes based on the last version:
1. Catch up to the current commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.
Andrey also seems to be proposing the similar patch [1]/messages/by-id/9964cf46-9294-34b9-4858-971e9029f5c7@postgrespro.ru that introduces CSN
into core. Could you tell me what the difference between his patch and yours?
If they are almost the same, we should focus on one together rather than
working separately?
Regards,
[1]: /messages/by-id/9964cf46-9294-34b9-4858-971e9029f5c7@postgrespro.ru
/messages/by-id/9964cf46-9294-34b9-4858-971e9029f5c7@postgrespro.ru
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2020/06/15 16:48, Fujii Masao wrote:
On 2020/06/12 18:41, movead.li@highgo.ca wrote:
Hello hackers,
Currently, I do some changes based on the last version:
1. Catch up to the current commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.
Probably it's not time to do the code review yet, but when I glanced the patch,
I came up with one question.
0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
(and inserts it into WAL buffers). Which means that new WAL record is generated
whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
really necessary for CSN?
BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
acceptable because spinlocks are intended for *very* short-term locks.
Secondly, I don't think that WAL generation during ProcArrayLock is good
design because ProcArrayLock is likely to be bottleneck and its term should
be short for performance gain.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Thanks for reply.
Probably it's not time to do the code review yet, but when I glanced the patch,
I came up with one question.
0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
(and inserts it into WAL buffers). Which means that new WAL record is generated
whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
really necessary for CSN?
This is designed for crash recovery, here we record our most new lsn in wal so it
will not use a history lsn after a restart. It will not write into wal every time, but with
a gap which you can see CSNAddByNanosec() function.
BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
acceptable because spinlocks are intended for *very* short-term locks.
Secondly, I don't think that WAL generation during ProcArrayLock is good
design because ProcArrayLock is likely to be bottleneck and its term should
be short for performance gain.
Thanks for point out which may help me deeply, I will reconsider that.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Thanks for reply.
AFAIU, this patch is to improve scalability and also will be helpful
for Global Snapshots stuff, is that right? If so, how much
performance/scalability benefit this patch will have after Andres's
recent work on scalability [1]?
The patch focus on to be an infrastructure of sharding feature, according
to my test almost has the same performance with and without the patch.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/06/19 12:12, movead.li@highgo.ca wrote:
Thanks for reply.
Probably it's not time to do the code review yet, but when I glanced the patch,
I came up with one question.
0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
(and inserts it into WAL buffers). Which means that new WAL record is generated
whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
really necessary for CSN?This is designed for crash recovery, here we record our most new lsn in wal so it
will not use a history lsn after a restart. It will not write into wal every time, but with
a gap which you can see CSNAddByNanosec() function.
You mean that the last generated CSN needs to be WAL-logged because any smaller
CSN than the last one should not be reused after crash recovery. Right?
If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?
BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
acceptable because spinlocks are intended for *very* short-term locks.
Secondly, I don't think that WAL generation during ProcArrayLock is good
design because ProcArrayLock is likely to be bottleneck and its term should
be short for performance gain.Thanks for point out which may help me deeply, I will reconsider that.
Thanks for working on this!
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
You mean that the last generated CSN needs to be WAL-logged because any smaller
CSN than the last one should not be reused after crash recovery. Right?
Yes that's it.
If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?
CSN collected based on time of system in this patch, but time is not reliable all the
time. And it designed for Global CSN(for sharding) where it may rely on CSN from
other node , which generated from other machine.
So monotonically is not reliable and it need to keep it's largest CSN in wal in case
of crash.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/06/19 13:36, movead.li@highgo.ca wrote:
You mean that the last generated CSN needs to be WAL-logged because any smaller
CSN than the last one should not be reused after crash recovery. Right?Yes that's it.
If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?CSN collected based on time of system in this patch, but time is not reliable all the
time. And it designed for Global CSN(for sharding) where it may rely on CSN from
other node , which generated from other machine.So monotonically is not reliable and it need to keep it's largest CSN in wal in case
of crash.
Thanks for the explanaion! Understood.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 6/12/20 2:41 PM, movead.li@highgo.ca wrote:
Hello hackers,
Currently, I do some changes based on the last version:
1. Catch up to the current commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.
Some remarks on your patch:
1. The variable last_max_csn can be an atomic variable.
2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
This is the case when someone changed the value of the system clock. I
think it is needed to write a WARNING to the log file. (May be we can do
synchronization with a time server.
3. That about global snapshot xmin? In the pgpro version of the patch we
had GlobalSnapshotMapXmin() routine to maintain circular buffer of
oldestXmins for several seconds in past. This buffer allows to shift
oldestXmin in the past when backend is importing global transaction.
Otherwise old versions of tuples that were needed for this transaction
can be recycled by other processes (vacuum, HOT, etc).
How do you implement protection from local pruning? I saw
SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
4. The current version of the patch is not applied clearly with current
master.
--
regards,
Andrey Lepikhov
Postgres Professional
Thanks for the remarks,
Some remarks on your patch:
1. The variable last_max_csn can be an atomic variable.
Yes will consider.
2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
This is the case when someone changed the value of the system clock. I
think it is needed to write a WARNING to the log file. (May be we can do
synchronization with a time server.
Yes good point, I will work out a way to report the warning, it should exist a
report gap rather than report every time it generates CSN.
If we really need a correct time? What's the inferiority if one node generate
csn by monotonically increasing?
3. That about global snapshot xmin? In the pgpro version of the patch we
had GlobalSnapshotMapXmin() routine to maintain circular buffer of
oldestXmins for several seconds in past. This buffer allows to shift
oldestXmin in the past when backend is importing global transaction.
Otherwise old versions of tuples that were needed for this transaction
can be recycled by other processes (vacuum, HOT, etc).
How do you implement protection from local pruning? I saw
SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
I have researched your patch which is so great, in the patch only data
out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
tuple even if no snapshot import at all,right?
I am thanking about a way if we can start remain dead tuple just before
we import a csn snapshot.
Base on Clock-SI paper, we should get local CSN then send to shard nodes,
because we do not known if the shard nodes' csn bigger or smaller then
master node, so we should keep some dead tuple all the time to support
snapshot import anytime.
Then if we can do a small change to CLock-SI model, we do not use the
local csn when transaction start, instead we touch every shard node for
require their csn, and shard nodes start keep dead tuple, and master node
choose the biggest csn to send to shard nodes.
By the new way, we do not need to keep dead tuple all the time and do
not need to manage a ring buf, we can give to ball to 'snapshot too old'
feature. But for trade off, almost all shard node need wait.
I will send more detail explain in few days.
4. The current version of the patch is not applied clearly with current
master.
Maybe it's because of the release of PG13, it cause some conflict, I will
rebase it.
---
Regards,
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca/
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 7/2/20 7:31 PM, Movead Li wrote:
Thanks for the remarks,
Some remarks on your patch:
1. The variable last_max_csn can be an atomic variable.Yes will consider.
2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
This is the case when someone changed the value of the system clock. I
think it is needed to write a WARNING to the log file. (May be we can do
synchronization with a time server.Yes good point, I will work out a way to report the warning, it should
exist a
report gap rather than report every time it generates CSN.
If we really need a correct time? What's the inferiority if one node
generate
csn by monotonically increasing?
Changes in time values can lead to poor effects, such as old snapshot.
Adjusting the time can be a kind of defense.
3. That about global snapshot xmin? In the pgpro version of the patch we
had GlobalSnapshotMapXmin() routine to maintain circular buffer of
oldestXmins for several seconds in past. This buffer allows to shift
oldestXmin in the past when backend is importing global transaction.
Otherwise old versions of tuples that were needed for this transaction
can be recycled by other processes (vacuum, HOT, etc).
How do you implement protection from local pruning? I saw
SNAP_DESYNC_COMPLAIN, but it is not used anywhere.I have researched your patch which is so great, in the patch only data
out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
tuple even if no snapshot import at all,right?I am thanking about a way if we can start remain dead tuple just before
we import a csn snapshot.Base on Clock-SI paper, we should get local CSN then send to shard nodes,
because we do not known if the shard nodes' csn bigger or smaller then
master node, so we should keep some dead tuple all the time to support
snapshot import anytime.Then if we can do a small change to CLock-SI model, we do not use the
local csn when transaction start, instead we touch every shard node for
require their csn, and shard nodes start keep dead tuple, and master node
choose the biggest csn to send to shard nodes.By the new way, we do not need to keep dead tuple all the time and do
not need to manage a ring buf, we can give to ball to 'snapshot too old'
feature. But for trade off, almost all shard node need wait.
I will send more detail explain in few days.
I think, in the case of distributed system and many servers it can be
bottleneck.
Main idea of "deferred time" is to reduce interference between DML
queries in the case of intensive OLTP workload. This time can be reduced
if the bloationg of a database prevails over the frequency of
transaction aborts.
4. The current version of the patch is not applied clearly with current
master.Maybe it's because of the release of PG13, it cause some conflict, I will
rebase it.
Ok
---
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca <http://www.highgo.ca/>
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
--
regards,
Andrey Lepikhov
Postgres Professional
Hello Andrey
I have researched your patch which is so great, in the patch only data
out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
tuple even if no snapshot import at all,right?I am thanking about a way if we can start remain dead tuple just before
we import a csn snapshot.Base on Clock-SI paper, we should get local CSN then send to shard nodes,
because we do not known if the shard nodes' csn bigger or smaller then
master node, so we should keep some dead tuple all the time to support
snapshot import anytime.Then if we can do a small change to CLock-SI model, we do not use the
local csn when transaction start, instead we touch every shard node for
require their csn, and shard nodes start keep dead tuple, and master node
choose the biggest csn to send to shard nodes.By the new way, we do not need to keep dead tuple all the time and do
not need to manage a ring buf, we can give to ball to 'snapshot too old'
feature. But for trade off, almost all shard node need wait.
I will send more detail explain in few days.I think, in the case of distributed system and many servers it can be
bottleneck.
Main idea of "deferred time" is to reduce interference between DML
queries in the case of intensive OLTP workload. This time can be reduced
if the bloationg of a database prevails over the frequency of
transaction aborts.
OK there maybe a performance issue, and I have another question about Clock-SI.
For example we have three nodes, shard1(as master), shard2, shard3, which
(time of node2) > (time of node2) > (time of node3), and you can see a picture:
http://movead.gitee.io/picture/blog_img_bad/csn/clock_si_question.png
As far as I know about Clock-SI, left part of the blue line will setup as a snapshotif master require a snapshot at time t1. But in fact data A should in snapshot butnot and data B should out of snapshot but not.
If this scene may appear in your origin patch? Or something my understand aboutClock-SI is wrong?
On 7/4/20 7:56 PM, movead.li@highgo.ca wrote:
As far as I know about Clock-SI, left part of the blue line will
setup as a snapshotif master require a snapshot at time t1. But in fact data A should
in snapshot butnot and data B should out of snapshot but not.
If this scene may appear in your origin patch? Or something my
understand aboutClock-SI is wrong?
Sorry for late answer.
I have doubts that I fully understood your question, but still.
What real problems do you see here? Transaction t1 doesn't get state of
shard2 until time at node with shard2 won't reach start time of t1.
If transaction, that inserted B wants to know about it position in time
relatively to t1 it will generate CSN, attach to node1 and will see,
that t1 is not started yet.
Maybe you are saying about the case that someone who has a faster data
channel can use the knowledge from node1 to change the state at node2?
If so, i think it is not a problem, or you can explain your idea.
--
regards,
Andrey Lepikhov
Postgres Professional
I have doubts that I fully understood your question, but still.
What real problems do you see here? Transaction t1 doesn't get state of
shard2 until time at node with shard2 won't reach start time of t1.
If transaction, that inserted B wants to know about it position in time
relatively to t1 it will generate CSN, attach to node1 and will see,
that t1 is not started yet.
Maybe you are saying about the case that someone who has a faster data
channel can use the knowledge from node1 to change the state at node2?
If so, i think it is not a problem, or you can explain your idea.
Sorry, I think this is my wrong understand about Clock-SI. At first I expect
we can get a absolutly snapshot, for example B should not include in the
snapshot because it happened after time t1. How ever Clock-SI can not guarantee
that and no design guarantee that at all.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/06/19 14:54, Fujii Masao wrote:
On 2020/06/19 13:36, movead.li@highgo.ca wrote:
>You mean that the last generated CSN needs to be WAL-logged because any smaller
CSN than the last one should not be reused after crash recovery. Right?
Yes that's it.
If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?CSN collected based on time of system in this patch, but time is not reliable all the
time. And it designed for Global CSN(for sharding) where it may rely on CSN from
other node , which generated from other machine.So monotonically is not reliable and it need to keep it's largest CSN in wal in case
of crash.Thanks for the explanaion! Understood.
I have another question about this patch;
When checking each tuple visibility, we always have to get the CSN
corresponding to XMIN or XMAX from CSN SLRU. In the past discussion,
there was the suggestion that CSN should be stored in the tuple header
or somewhere (like hint bit) to avoid the overhead by very frequehntly
lookup for CSN SLRU. I'm not sure the conclusion of this discussion.
But this patch doesn't seem to adopt that idea. So did you confirm that
such performance overhead by lookup for CSN SLRU is negligible?
Of course I know that idea has big issue, i.e., there is no enough space
to store CSN in a tuple header if CSN is 64 bits. If CSN is 32 bits, we may
be able to replace XMIN or XMAX with CSN corresponding to them. But
it means that we have to struggle with one more wraparound issue
(CSN wraparound issue). So it's not easy to adopt that idea...
Sorry if this was already discussed and concluded...
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
When checking each tuple visibility, we always have to get the CSN
corresponding to XMIN or XMAX from CSN SLRU. In the past discussion,
there was the suggestion that CSN should be stored in the tuple header
or somewhere (like hint bit) to avoid the overhead by very frequehntly
lookup for CSN SLRU. I'm not sure the conclusion of this discussion.
But this patch doesn't seem to adopt that idea. So did you confirm that
such performance overhead by lookup for CSN SLRU is negligible?
This patch came from postgrespro's patch which shows a good performance,
I have simple test on current patch and result no performance decline.
And not everytime we do a tuple visibility need lookup forCSN SLRU, only xid
large than 'TransactionXmin' need that. Maybe we have not touch the case
which cause bad performance, so it shows good performance temporary.
Of course I know that idea has big issue, i.e., there is no enough space
to store CSN in a tuple header if CSN is 64 bits. If CSN is 32 bits, we may
be able to replace XMIN or XMAX with CSN corresponding to them. But
it means that we have to struggle with one more wraparound issue
(CSN wraparound issue). So it's not easy to adopt that idea...
Sorry if this was already discussed and concluded...
I think your point with CSN in tuple header is a exciting approach, but I have
not seen the discussion, can you show me the discussion address?
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
On 2020/07/14 11:02, movead.li@highgo.ca wrote:
When checking each tuple visibility, we always have to get the CSN
corresponding to XMIN or XMAX from CSN SLRU. In the past discussion,
there was the suggestion that CSN should be stored in the tuple header
or somewhere (like hint bit) to avoid the overhead by very frequehntly
lookup for CSN SLRU. I'm not sure the conclusion of this discussion.
But this patch doesn't seem to adopt that idea. So did you confirm that
such performance overhead by lookup for CSN SLRU is negligible?This patch came from postgrespro's patch which shows a good performance,
I have simple test on current patch and result no performance decline.
This is good news! When I read the past discussions about CSN, my impression
was that the performance overhead by CSN SLRU lookup might become one of
show-stopper for CSN. So I was worring about this issue...
And not everytime we do a tuple visibility need lookup forCSN SLRU, only xid
large than 'TransactionXmin' need that. Maybe we have not touch the case
which cause bad performance, so it shows good performance temporary.
Yes, we would need more tests in several cases.
Of course I know that idea has big issue, i.e., there is no enough space
to store CSN in a tuple header if CSN is 64 bits. If CSN is 32 bits, we may
be able to replace XMIN or XMAX with CSN corresponding to them. But
it means that we have to struggle with one more wraparound issue
(CSN wraparound issue). So it's not easy to adopt that idea...Sorry if this was already discussed and concluded...
I think your point with CSN in tuple header is a exciting approach, but I have
not seen the discussion, can you show me the discussion address?
Probably you can find the discussion by searching with the keywords
"CSN" and "hint bit". For example,
/messages/by-id/CAPpHfdv7BMwGv=OfUg3S-jGVFKqHi79pR_ZK1Wsk-13oZ+cy5g@mail.gmail.com
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 7/13/20 11:46 AM, movead.li@highgo.ca wrote:
I continue to see your patch. Some code improvements see at the attachment.
Questions:
* csnSnapshotActive is the only member of the CSNshapshotShared struct.
* The WriteAssignCSNXlogRec() routine. I din't understand why you add 20
nanosec to current CSN and write this into the WAL. For simplify our
communication, I rewrote this routine in accordance with my opinion (see
patch in attachment).
At general, maybe we will add your WAL writing CSN machinery + TAP tests
to the patch from the thread [1]/messages/by-id/07b2c899-4ed0-4c87-1327-23c750311248@postgrespro.ru and work on it together?
[1]: /messages/by-id/07b2c899-4ed0-4c87-1327-23c750311248@postgrespro.ru
/messages/by-id/07b2c899-4ed0-4c87-1327-23c750311248@postgrespro.ru
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
0001-improvements.patchtext/x-patch; charset=UTF-8; name=0001-improvements.patchDownload
From 9a1595507c83b5fde61a6a3cc30f6df9df410e76 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Wed, 15 Jul 2020 11:55:00 +0500
Subject: [PATCH] 1
---
src/backend/access/transam/csn_log.c | 35 ++++++++++++--------------
src/include/access/csn_log.h | 8 +++---
src/test/regress/expected/sysviews.out | 3 ++-
3 files changed, 22 insertions(+), 24 deletions(-)
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 319e89c805..53d3877851 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -150,8 +150,8 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
*/
static void
CSNLogSetPageStatus(TransactionId xid, int nsubxids,
- TransactionId *subxids,
- XidCSN csn, int pageno)
+ TransactionId *subxids,
+ XidCSN csn, int pageno)
{
int slotno;
int i;
@@ -187,8 +187,8 @@ CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
Assert(LWLockHeldByMe(CSNLogControlLock));
- ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
-
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] +
+ entryno * sizeof(XidCSN));
*ptr = csn;
}
@@ -205,17 +205,16 @@ CSNLogGetCSNByXid(TransactionId xid)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToPgIndex(xid);
int slotno;
- XidCSN *ptr;
- XidCSN xid_csn;
+ XidCSN csn;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
- ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
- xid_csn = *ptr;
+ csn = *(XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] +
+ entryno * sizeof(XidCSN));
LWLockRelease(CSNLogControlLock);
- return xid_csn;
+ return csn;
}
/*
@@ -501,15 +500,15 @@ WriteAssignCSNXlogRec(XidCSN xidcsn)
{
XidCSN log_csn = 0;
- if(xidcsn > get_last_log_wal_csn())
- {
- log_csn = CSNAddByNanosec(xidcsn, 20);
- set_last_log_wal_csn(log_csn);
- }
- else
- {
+ if(xidcsn <= get_last_log_wal_csn())
+ /*
+ * WAL-write related code. If concurrent backend already wrote into WAL
+ * its CSN with bigger value it isn't needed to write this value.
+ */
return;
- }
+
+ log_csn = CSNAddByNanosec(xidcsn, 0);
+ set_last_log_wal_csn(log_csn);
XLogBeginInsert();
XLogRegisterData((char *) (&log_csn), sizeof(XidCSN));
@@ -571,7 +570,6 @@ csnlog_redo(XLogReaderState *record)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
set_last_max_csn(csn);
LWLockRelease(CSNLogControlLock);
-
}
else if (info == XLOG_CSN_SETXIDCSN)
{
@@ -589,7 +587,6 @@ csnlog_redo(XLogReaderState *record)
SimpleLruWritePage(CsnlogCtl, slotno);
LWLockRelease(CSNLogControlLock);
Assert(!CsnlogCtl->shared->page_dirty[slotno]);
-
}
else if (info == XLOG_CSN_TRUNCATE)
{
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 5838028a30..c23a71446a 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -15,10 +15,10 @@
#include "utils/snapshot.h"
/* XLOG stuff */
-#define XLOG_CSN_ASSIGNMENT 0x00
-#define XLOG_CSN_SETXIDCSN 0x10
-#define XLOG_CSN_ZEROPAGE 0x20
-#define XLOG_CSN_TRUNCATE 0x30
+#define XLOG_CSN_ASSIGNMENT 0x00
+#define XLOG_CSN_SETXIDCSN 0x10
+#define XLOG_CSN_ZEROPAGE 0x20
+#define XLOG_CSN_TRUNCATE 0x30
typedef struct xl_xidcsn_set
{
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..cc169a1999 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,6 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
+ enable_csn_snapshot | off
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--
2.25.1
Currently, we are developing and test global snapshot on branch[2] created by
Andrey, I want to keep a latest patch set on this thread so that hackers can easily
catch every change on this area.
This time it change little point come up by Fujii Masao about WriteXidCsnXlogRec()
should out of spinlocks, and add comments for CSNAddByNanosec(), and other
fine tunings.
[1]: https://github.com/danolivo/pgClockSI
https://github.com/danolivo/pgClockSI
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Attachments:
0002-Wal-for-csn.patchapplication/octet-stream; name=0002-Wal-for-csn.patchDownload
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..15fc36f7b4 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
brindesc.o \
clogdesc.o \
+ csnlogdesc.o \
committsdesc.o \
dbasedesc.o \
genericdesc.o \
diff --git a/src/backend/access/rmgrdesc/csnlogdesc.c b/src/backend/access/rmgrdesc/csnlogdesc.c
new file mode 100644
index 0000000000..e96b056325
--- /dev/null
+++ b/src/backend/access/rmgrdesc/csnlogdesc.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * clogdesc.c
+ * rmgr descriptor routines for access/transam/csn_log.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/csnlogdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+
+
+void
+csnlog_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ appendStringInfo(buf, "assign "INT64_FORMAT"", csn);
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) rec;
+ int nsubxids;
+
+ appendStringInfo(buf, "set "INT64_FORMAT" for: %u",
+ xlrec->xidcsn,
+ xlrec->xtop);
+ nsubxids = ((XLogRecGetDataLen(record) - MinSizeOfXidCSNSet) /
+ sizeof(TransactionId));
+ if (nsubxids > 0)
+ {
+ int i;
+ TransactionId *subxids;
+
+ subxids = palloc(sizeof(TransactionId) * nsubxids);
+ memcpy(subxids,
+ XLogRecGetData(record) + MinSizeOfXidCSNSet,
+ sizeof(TransactionId) * nsubxids);
+ for (i = 0; i < nsubxids; i++)
+ appendStringInfo(buf, ", %u", subxids[i]);
+ pfree(subxids);
+ }
+ }
+}
+
+const char *
+csnlog_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_CSN_ASSIGNMENT:
+ id = "ASSIGNMENT";
+ break;
+ case XLOG_CSN_SETXIDCSN:
+ id = "SETXIDCSN";
+ break;
+ case XLOG_CSN_ZEROPAGE:
+ id = "ZEROPAGE";
+ break;
+ case XLOG_CSN_TRUNCATE:
+ id = "TRUNCATE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..44e2e8ecec 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -114,7 +114,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
"max_wal_senders=%d max_prepared_xacts=%d "
"max_locks_per_xact=%d wal_level=%s "
- "wal_log_hints=%s track_commit_timestamp=%s",
+ "wal_log_hints=%s track_commit_timestamp=%s "
+ "enable_csn_snapshot=%s",
xlrec.MaxConnections,
xlrec.max_worker_processes,
xlrec.max_wal_senders,
@@ -122,7 +123,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
xlrec.max_locks_per_xact,
wal_level_str,
xlrec.wal_log_hints ? "on" : "off",
- xlrec.track_commit_timestamp ? "on" : "off");
+ xlrec.track_commit_timestamp ? "on" : "off",
+ xlrec.enable_csn_snapshot ? "on" : "off");
}
else if (info == XLOG_FPW_CHANGE)
{
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 4e0b8d64e4..22a95cb5d3 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -9,6 +9,11 @@
* transactions. Because of same lifetime and persistancy requirements
* this module is quite similar to subtrans.c
*
+ * If we switch database from CSN-base snapshot to xid-base snapshot then,
+ * nothing wrong. But if we switch xid-base snapshot to CSN-base snapshot
+ * it should decide a new xid whwich begin csn-base check. It can not be
+ * oldestActiveXID because of prepared transaction.
+ *
* Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -52,7 +57,8 @@ bool enable_csn_snapshot;
static SlruCtlData CSNLogCtlData;
#define CsnlogCtl (&CSNLogCtlData)
-static int ZeroCSNLogPage(int pageno);
+static int ZeroCSNLogPage(int pageno, bool write_xlog);
+static void ZeroTruncateCSNLogPage(int pageno, bool write_xlog);
static bool CSNLogPagePrecedes(int page1, int page2);
static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids,
@@ -60,6 +66,11 @@ static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
int slotno);
+static void WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+static void WriteZeroCSNPageXlogRec(int pageno);
+static void WriteTruncateCSNXlogRec(int pageno);
+
/*
* CSNLogSetCSN
*
@@ -77,7 +88,7 @@ static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
*/
void
CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn)
+ TransactionId *subxids, XidCSN csn, bool write_xlog)
{
int pageno;
int i = 0;
@@ -89,6 +100,10 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
+
+ if(write_xlog)
+ WriteXidCsnXlogRec(xid, nsubxids, subxids, csn);
+
for (;;)
{
int num_on_page = 0;
@@ -151,12 +166,12 @@ CSNLogSetPageStatus(TransactionId xid, int nsubxids,
static void
CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
{
- int entryno = TransactionIdToPgIndex(xid);
- XidCSN *ptr;
+ int entryno = TransactionIdToPgIndex(xid);
+ XidCSN *ptr;
Assert(LWLockHeldByMe(CSNLogControlLock));
- ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XidCSN));
*ptr = csn;
}
@@ -171,27 +186,21 @@ CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
XidCSN
CSNLogGetCSNByXid(TransactionId xid)
{
- int pageno = TransactionIdToPage(xid);
- int entryno = TransactionIdToPgIndex(xid);
- int slotno;
- XidCSN *ptr;
- XidCSN xid_csn;
+ int pageno = TransactionIdToPage(xid);
+ int entryno = TransactionIdToPgIndex(xid);
+ int slotno;
+ XidCSN csn;
/* Callers of CSNLogGetCSNByXid() must check GUC params */
Assert(enable_csn_snapshot);
- /* Can't ask about stuff that might not be around anymore */
- Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
-
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
- ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
- xid_csn = *ptr;
+ csn = *(XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
LWLockRelease(CSNLogControlLock);
- return xid_csn;
+ return csn;
}
/*
@@ -245,7 +254,7 @@ BootStrapCSNLog(void)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0);
+ slotno = ZeroCSNLogPage(0, false);
/* Make sure it's written out */
SimpleLruWritePage(CsnlogCtl, slotno);
@@ -263,50 +272,20 @@ BootStrapCSNLog(void)
* Control lock must be held at entry, and will be held at exit.
*/
static int
-ZeroCSNLogPage(int pageno)
+ZeroCSNLogPage(int pageno, bool write_xlog)
{
Assert(LWLockHeldByMe(CSNLogControlLock));
+ if(write_xlog)
+ WriteZeroCSNPageXlogRec(pageno);
return SimpleLruZeroPage(CsnlogCtl, pageno);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend startup,
- * after StartupXLOG has initialized ShmemVariableCache->nextXid.
- *
- * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
- * if there are none.
- */
-void
-StartupCSNLog(TransactionId oldestActiveXID)
+static void
+ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
{
- int startPage;
- int endPage;
-
- if (!enable_csn_snapshot)
- return;
-
- /*
- * Since we don't expect pg_csn to be valid across crashes, we
- * initialize the currently-active page(s) to zeroes during startup.
- * Whenever we advance into a new page, ExtendCSNLog will likewise
- * zero the new page without regard to whatever was previously on disk.
- */
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- startPage = TransactionIdToPage(oldestActiveXID);
- endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
-
- while (startPage != endPage)
- {
- (void) ZeroCSNLogPage(startPage);
- startPage++;
- /* must account for wraparound */
- if (startPage > TransactionIdToPage(MaxTransactionId))
- startPage = 0;
- }
- (void) ZeroCSNLogPage(startPage);
-
- LWLockRelease(CSNLogControlLock);
+ if(write_xlog)
+ WriteTruncateCSNXlogRec(pageno);
+ SimpleLruTruncate(CsnlogCtl, pageno);
}
/*
@@ -379,7 +358,7 @@ ExtendCSNLog(TransactionId newestXact)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
- ZeroCSNLogPage(pageno);
+ ZeroCSNLogPage(pageno, !InRecovery);
LWLockRelease(CSNLogControlLock);
}
@@ -410,7 +389,7 @@ TruncateCSNLog(TransactionId oldestXact)
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
- SimpleLruTruncate(CsnlogCtl, cutoffPage);
+ ZeroTruncateCSNLogPage(cutoffPage, true);
}
/*
@@ -436,3 +415,121 @@ CSNLogPagePrecedes(int page1, int page2)
return TransactionIdPrecedes(xid1, xid2);
}
+
+void
+WriteAssignCSNXlogRec(XidCSN xidcsn)
+{
+ XidCSN log_csn = 0;
+
+ if(xidcsn <= get_last_log_wal_csn())
+ {
+ /*
+ * WAL-write related code. If concurrent backend already wrote into WAL
+ * its CSN with bigger value it isn't needed to write this value.
+ */
+ return;
+ }
+
+ /*
+ * We log the CSN 5s greater than generated, you can see comments on
+ * CSN_ASSIGN_TIME_INTERVAL define.
+ */
+ log_csn = CSNAddByNanosec(xidcsn, CSN_ASSIGN_TIME_INTERVAL);
+ set_last_log_wal_csn(log_csn);
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&log_csn), sizeof(XidCSN));
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ASSIGNMENT);
+}
+
+static void
+WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ xl_xidcsn_set xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.xtop = xid;
+ xlrec.nsubxacts = nsubxids;
+ xlrec.xidcsn = csn;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
+ XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
+ XLogFlush(recptr);
+}
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroCSNPageXlogRec(int pageno)
+{
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ (void) XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ZEROPAGE);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateCSNXlogRec(int pageno)
+{
+ XLogRecPtr recptr;
+ return;
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
+ XLogFlush(recptr);
+}
+
+
+void
+csnlog_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ /* Backup blocks are not used in csnlog records */
+ Assert(!XLogRecHasAnyBlockRefs(record));
+
+ if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ set_last_max_csn(csn);
+ LWLockRelease(CSNLogControlLock);
+
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) XLogRecGetData(record);
+ CSNLogSetCSN(xlrec->xtop, xlrec->nsubxacts, xlrec->xsub, xlrec->xidcsn, false);
+ }
+ else if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+ int slotno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(pageno, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ CsnlogCtl->shared->latest_page_number = pageno;
+ ZeroTruncateCSNLogPage(pageno, false);
+ }
+ else
+ elog(PANIC, "csnlog_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index bcc5bac757..99e4a2f1ed 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -31,6 +31,8 @@
/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+TransactionId xmin_for_csn = InvalidTransactionId;
+
/*
* CSNSnapshotState
*
@@ -40,7 +42,9 @@
*/
typedef struct
{
- SnapshotCSN last_max_csn;
+ SnapshotCSN last_max_csn; /* Record the max csn till now */
+ XidCSN last_csn_log_wal; /* for interval we log the assign csn to wal */
+ TransactionId xmin_for_csn; /*'xmin_for_csn' for when turn xid-snapshot to csn-snapshot*/
volatile slock_t lock;
} CSNSnapshotState;
@@ -80,6 +84,7 @@ CSNSnapshotShmemInit()
if (!found)
{
csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
SpinLockInit(&csnState->lock);
}
}
@@ -119,6 +124,8 @@ GenerateCSN(bool locked)
if (!locked)
SpinLockRelease(&csnState->lock);
+ WriteAssignCSNXlogRec(csn);
+
return csn;
}
@@ -131,7 +138,7 @@ GenerateCSN(bool locked)
XidCSN
TransactionIdGetXidCSN(TransactionId xid)
{
- XidCSN xid_csn;
+ XidCSN xid_csn;
Assert(enable_csn_snapshot);
@@ -145,13 +152,35 @@ TransactionIdGetXidCSN(TransactionId xid)
Assert(false); /* Should not happend */
}
+ /*
+ * If we just switch a xid-snapsot to a csn_snapshot, we should handle a start
+ * xid for csn basse check. Just in case we have prepared transaction which
+ * hold the TransactionXmin but without CSN.
+ */
+ if(InvalidTransactionId == xmin_for_csn)
+ {
+ SpinLockAcquire(&csnState->lock);
+ if(InvalidTransactionId != csnState->xmin_for_csn)
+ xmin_for_csn = csnState->xmin_for_csn;
+ else
+ xmin_for_csn = FrozenTransactionId;
+
+ SpinLockRelease(&csnState->lock);
+ }
+
+ if ( FrozenTransactionId != xmin_for_csn ||
+ TransactionIdPrecedes(xmin_for_csn, TransactionXmin))
+ {
+ xmin_for_csn = TransactionXmin;
+ }
+
/*
* For xids which less then TransactionXmin CSNLog can be already
* trimmed but we know that such transaction is definetly not concurrently
* running according to any snapshot including timetravel ones. Callers
* should check TransactionDidCommit after.
*/
- if (TransactionIdPrecedes(xid, TransactionXmin))
+ if (TransactionIdPrecedes(xid, xmin_for_csn))
return FrozenXidCSN;
/* Read XidCSN from SLRU */
@@ -251,7 +280,7 @@ CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
if (!enable_csn_snapshot)
return;
- CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
/*
* Clean assignedXidCsn anyway, as it was possibly set in
@@ -292,7 +321,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
{
Assert(XidCSNIsInProgress(oldassignedXidCsn));
CSNLogSetCSN(xid, nsubxids,
- subxids, InDoubtXidCSN);
+ subxids, InDoubtXidCSN, true);
}
else
{
@@ -333,8 +362,39 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
Assert(XidCSNIsNormal(assigned_xid_csn));
CSNLogSetCSN(xid, nsubxids,
- subxids, assigned_xid_csn);
+ subxids, assigned_xid_csn, true);
/* Reset for next transaction */
pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
}
+
+void
+set_last_max_csn(XidCSN xidcsn)
+{
+ csnState->last_max_csn = xidcsn;
+}
+
+void
+set_last_log_wal_csn(XidCSN xidcsn)
+{
+ csnState->last_csn_log_wal = xidcsn;
+}
+
+XidCSN
+get_last_log_wal_csn(void)
+{
+ XidCSN last_csn_log_wal;
+
+ last_csn_log_wal = csnState->last_csn_log_wal;
+
+ return last_csn_log_wal;
+}
+
+/*
+ * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ */
+void
+set_xmin_for_csn(void)
+{
+ csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..b1e5ec350e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -28,6 +28,7 @@
#include "replication/origin.h"
#include "storage/standby.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f21e09a03..dc2e9ae874 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4607,6 +4607,7 @@ InitControlFile(uint64 sysidentifier)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
ControlFile->data_checksum_version = bootstrap_data_checksum_version;
}
@@ -7064,7 +7065,6 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7882,7 +7882,6 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -9106,7 +9105,6 @@ CreateCheckPoint(int flags)
if (!RecoveryInProgress())
{
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
- TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
}
/* Real work is done, but log and update stats before releasing lock. */
@@ -9736,7 +9734,8 @@ XLogReportParameters(void)
max_wal_senders != ControlFile->max_wal_senders ||
max_prepared_xacts != ControlFile->max_prepared_xacts ||
max_locks_per_xact != ControlFile->max_locks_per_xact ||
- track_commit_timestamp != ControlFile->track_commit_timestamp)
+ track_commit_timestamp != ControlFile->track_commit_timestamp ||
+ enable_csn_snapshot != ControlFile->enable_csn_snapshot)
{
/*
* The change in number of backend slots doesn't need to be WAL-logged
@@ -9758,6 +9757,7 @@ XLogReportParameters(void)
xlrec.wal_level = wal_level;
xlrec.wal_log_hints = wal_log_hints;
xlrec.track_commit_timestamp = track_commit_timestamp;
+ xlrec.enable_csn_snapshot = enable_csn_snapshot;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9768,6 +9768,9 @@ XLogReportParameters(void)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
+ set_xmin_for_csn();
+
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -9776,6 +9779,7 @@ XLogReportParameters(void)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
@@ -10208,6 +10212,7 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e9..083a226dce 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -53,7 +53,7 @@
#include "utils/memutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
-
+#include "access/csn_log.h"
/*
* GUC parameters
@@ -1632,6 +1632,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
TruncateCLOG(frozenXID, oldestxid_datoid);
TruncateCommitTs(frozenXID);
+ TruncateCSNLog(frozenXID);
TruncateMultiXact(minMulti, minmulti_datoid);
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index d715750437..9283021c7b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1734,7 +1734,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 45fe574620..5fa195b913 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2265,7 +2265,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
if (XidInvisibleInCSNSnapshot(xid, snapshot))
{
XidCSN gcsn = TransactionIdGetXidCSN(xid);
- Assert(XidCSNIsAborted(gcsn));
+ Assert(XidCSNIsAborted(gcsn) || XidCSNIsInProgress(gcsn));
}
#endif
return false;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..e7194124c7 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -306,6 +306,8 @@ main(int argc, char *argv[])
ControlFile->max_locks_per_xact);
printf(_("track_commit_timestamp setting: %s\n"),
ControlFile->track_commit_timestamp ? _("on") : _("off"));
+ printf(_("enable_csn_snapshot setting: %s\n"),
+ ControlFile->enable_csn_snapshot ? _("on") : _("off"));
printf(_("Maximum data alignment: %u\n"),
ControlFile->maxAlign);
/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 70194eb096..863ee73d24 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -545,6 +545,11 @@ copy_xact_xlog_xid(void)
check_ok();
}
+ if(old_cluster.controldata.cat_ver > CSN_BASE_SNAPSHOT_ADD_VER)
+ {
+ copy_subdir_files("pg_csn", "pg_csn");
+ }
+
/* now reset the wal archives in the new cluster */
prep_status("Resetting WAL archives");
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 8b90cefbe0..f35860dfc5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -123,6 +123,8 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+#define CSN_BASE_SNAPSHOT_ADD_VER 202002010
+
/*
* Each relation is represented by a relinfo structure.
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..282bae882a 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -31,6 +31,7 @@
#include "rmgrdesc.h"
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
{ name, desc, identify},
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 9b9611127d..cc5c51c53f 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -14,17 +14,59 @@
#include "access/xlog.h"
#include "utils/snapshot.h"
+/* XLOG stuff */
+#define XLOG_CSN_ASSIGNMENT 0x00
+#define XLOG_CSN_SETXIDCSN 0x10
+#define XLOG_CSN_ZEROPAGE 0x20
+#define XLOG_CSN_TRUNCATE 0x30
+
+/*
+ * We should log MAX generated CSN to wal, so that database will not generate
+ * a historical CSN after database restart. This may appear when system time
+ * turned back.
+ *
+ * However we can not log the MAX CSN every time it generated, if so it will
+ * cause too many wal expend, so we log it 5s more in the future.
+ *
+ * As a trade off, when this database restart, there will be 5s bad performance
+ * for time synchronization among sharding nodes.
+ *
+ * It looks like we can redefine this as a configure parameter, and the user
+ * can decide which way they prefer.
+ *
+ */
+#define CSN_ASSIGN_TIME_INTERVAL 5
+
+typedef struct xl_xidcsn_set
+{
+ XidCSN xidcsn;
+ TransactionId xtop; /* XID's top-level XID */
+ int nsubxacts; /* number of subtransaction XIDs */
+ TransactionId xsub[FLEXIBLE_ARRAY_MEMBER]; /* assigned subxids */
+} xl_xidcsn_set;
+
+#define MinSizeOfXidCSNSet offsetof(xl_xidcsn_set, xsub)
+#define CSNAddByNanosec(csn,second) (csn + second * 1000000000L)
+
extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn);
+ TransactionId *subxids, XidCSN csn, bool write_xlog);
extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
extern Size CSNLogShmemSize(void);
extern void CSNLogShmemInit(void);
extern void BootStrapCSNLog(void);
-extern void StartupCSNLog(TransactionId oldestActiveXID);
extern void ShutdownCSNLog(void);
extern void CheckPointCSNLog(void);
extern void ExtendCSNLog(TransactionId newestXact);
extern void TruncateCSNLog(TransactionId oldestXact);
+extern void csnlog_redo(XLogReaderState *record);
+extern void csnlog_desc(StringInfo buf, XLogReaderState *record);
+extern const char *csnlog_identify(uint8 info);
+extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
+extern void set_last_max_csn(XidCSN xidcsn);
+extern void set_last_log_wal_csn(XidCSN xidcsn);
+extern XidCSN get_last_log_wal_csn(void);
+extern void set_xmin_for_csn(void);
+
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 6c15df7e70..b2d12bfb27 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CSNLOG_ID, "CSN", csnlog_redo, csnlog_desc, csnlog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 88f3d76700..02be3087ac 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -243,6 +243,7 @@ typedef struct xl_parameter_change
int wal_level;
bool wal_log_hints;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
} xl_parameter_change;
/* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..9e5d4b0fc0 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -181,6 +181,7 @@ typedef struct ControlFileData
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
/*
* This data is used to check for hardware-architecture compatibility of
0003-snapshot-switch.patchapplication/octet-stream; name=0003-snapshot-switch.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6ce5907896..8f296d9abb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9102,8 +9102,56 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</varlistentry>
</variablelist>
- </sect1>
+ <sect2 id="runtime-config-CSN-base-snapshot">
+ <title>CSN Based Snapshot</title>
+
+ <para>
+ By default, The snapshots in <productname>PostgreSQL</productname> uses the
+ XID (TransactionID) to identify the status of the transaction, the in-progress
+ transactions, and the future transactions for all its visibility calculations.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> also provides the CSN (commit-sequence-number)
+ based mechanism to identify the past-transactions and the ones that are yet to
+ be started/committed.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-enable-csn-snapshot" xreflabel="enable_csn_snapshot">
+ <term><varname>enable_csn_snapshot</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_csn_snapshot</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+
+ <para>
+ Enable/disable the CSN based transaction visibility tracking for the snapshot.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> uses the clock timestamp as a CSN,
+ so enabling the CSN based snapshots can be useful for implementing the global
+ snapshots and global transaction visibility.
+ </para>
+
+ <para>
+ when enabled <productname>PostgreSQL</productname> creates
+ <filename>pg_csn</filename> directory under <envar>PGDATA</envar> to keep
+ the track of CSN and XID mappings.
+ </para>
+
+ <para>
+ The default value is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect2>
+ </sect1>
<sect1 id="runtime-config-compatible">
<title>Version and Platform Compatibility</title>
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 9b006b744b..1df159bc8f 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -30,9 +30,28 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
+#include "storage/shmem.h"
bool enable_csn_snapshot;
+/*
+ * We use csnSnapshotActive to judge if csn snapshot enabled instead of by
+ * enable_csn_snapshot, this design is similar to 'track_commit_timestamp'.
+ *
+ * Because in process of replication if master change 'enable_csn_snapshot'
+ * in a database restart, standby should apply wal record for GUC changed,
+ * then it's difficult to notice all backends about that. So they can get
+ * the message by 'csnSnapshotActive' which in share buffer. It will not
+ * acquire a lock, so without performance issue.
+ *
+ */
+typedef struct CSNshapshotShared
+{
+ bool csnSnapshotActive;
+} CSNshapshotShared;
+
+CSNshapshotShared *csnShared = NULL;
+
/*
* Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
* everywhere else in Postgres.
@@ -94,9 +113,6 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
int i = 0;
int offset = 0;
- /* Callers of CSNLogSetCSN() must check GUC params */
- Assert(enable_csn_snapshot);
-
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
@@ -191,9 +207,6 @@ CSNLogGetCSNByXid(TransactionId xid)
int slotno;
XidCSN csn;
- /* Callers of CSNLogGetCSNByXid() must check GUC params */
- Assert(enable_csn_snapshot);
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
csn = *(XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
@@ -218,9 +231,6 @@ CSNLogShmemBuffers(void)
Size
CSNLogShmemSize(void)
{
- if (!enable_csn_snapshot)
- return 0;
-
return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
}
@@ -230,37 +240,25 @@ CSNLogShmemSize(void)
void
CSNLogShmemInit(void)
{
- if (!enable_csn_snapshot)
- return;
+ bool found;
+
CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+
+ csnShared = ShmemInitStruct("CSNlog shared",
+ sizeof(CSNshapshotShared),
+ &found);
}
/*
- * This func must be called ONCE on system install. It creates the initial
- * CSNLog segment. The pg_csn directory is assumed to have been
- * created by initdb, and CSNLogShmemInit must have been called already.
+ * See ActivateCSNlog
*/
void
BootStrapCSNLog(void)
{
- int slotno;
-
- if (!enable_csn_snapshot)
- return;
-
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- /* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0, false);
-
- /* Make sure it's written out */
- SimpleLruWritePage(CsnlogCtl, slotno);
- Assert(!CsnlogCtl->shared->page_dirty[slotno]);
-
- LWLockRelease(CSNLogControlLock);
+ return;
}
/*
@@ -288,13 +286,94 @@ ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
SimpleLruTruncate(CsnlogCtl, pageno);
}
+void
+ActivateCSNlog(void)
+{
+ int startPage;
+ TransactionId nextXid = InvalidTransactionId;
+
+ if (csnShared->csnSnapshotActive)
+ return;
+
+
+ nextXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ startPage = TransactionIdToPage(nextXid);
+
+ /* Create the current segment file, if necessary */
+ if (!SimpleLruDoesPhysicalPageExist(CsnlogCtl, startPage))
+ {
+ int slotno;
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(startPage, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ }
+ csnShared->csnSnapshotActive = true;
+}
+
+bool
+get_csnlog_status(void)
+{
+ if(!csnShared)
+ {
+ /* Should not arrived */
+ elog(ERROR, "We do not have csnShared point");
+ }
+ return csnShared->csnSnapshotActive;
+}
+
+void
+DeactivateCSNlog(void)
+{
+ csnShared->csnSnapshotActive = false;
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ (void) SlruScanDirectory(CsnlogCtl, SlruScanDirCbDeleteAll, NULL);
+ LWLockRelease(CSNLogControlLock);
+}
+
+void
+StartupCSN(void)
+{
+ ActivateCSNlog();
+}
+
+void
+CompleteCSNInitialization(void)
+{
+ /*
+ * If the feature is not enabled, turn it off for good. This also removes
+ * any leftover data.
+ *
+ * Conversely, we activate the module if the feature is enabled. This is
+ * necessary for primary and standby as the activation depends on the
+ * control file contents at the beginning of recovery or when a
+ * XLOG_PARAMETER_CHANGE is replayed.
+ */
+ if (!get_csnlog_status())
+ DeactivateCSNlog();
+ else
+ ActivateCSNlog();
+}
+
+void
+CSNlogParameterChange(bool newvalue, bool oldvalue)
+{
+ if (newvalue)
+ {
+ if (!csnShared->csnSnapshotActive)
+ ActivateCSNlog();
+ }
+ else if (csnShared->csnSnapshotActive)
+ DeactivateCSNlog();
+}
+
/*
* This must be called ONCE during postmaster or standalone-backend shutdown
*/
void
ShutdownCSNLog(void)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -314,7 +393,7 @@ ShutdownCSNLog(void)
void
CheckPointCSNLog(void)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -342,7 +421,7 @@ ExtendCSNLog(TransactionId newestXact)
{
int pageno;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -373,9 +452,9 @@ ExtendCSNLog(TransactionId newestXact)
void
TruncateCSNLog(TransactionId oldestXact)
{
- int cutoffPage;
+ int cutoffPage;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -388,7 +467,6 @@ TruncateCSNLog(TransactionId oldestXact)
*/
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
-
ZeroTruncateCSNLogPage(cutoffPage, true);
}
@@ -443,7 +521,6 @@ WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
TransactionId *subxids, XidCSN csn)
{
xl_xidcsn_set xlrec;
- XLogRecPtr recptr;
xlrec.xtop = xid;
xlrec.nsubxacts = nsubxids;
@@ -452,8 +529,7 @@ WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
- recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
- XLogFlush(recptr);
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
}
/*
@@ -473,12 +549,9 @@ WriteZeroCSNPageXlogRec(int pageno)
static void
WriteTruncateCSNXlogRec(int pageno)
{
- XLogRecPtr recptr;
- return;
XLogBeginInsert();
XLogRegisterData((char *) (&pageno), sizeof(int));
- recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
- XLogFlush(recptr);
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index ec090fd499..d7d0b5e90f 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -62,10 +62,7 @@ CSNSnapshotShmemSize(void)
{
Size size = 0;
- if (enable_csn_snapshot)
- {
- size += MAXALIGN(sizeof(CSNSnapshotState));
- }
+ size += MAXALIGN(sizeof(CSNSnapshotState));
return size;
}
@@ -76,17 +73,14 @@ CSNSnapshotShmemInit()
{
bool found;
- if (enable_csn_snapshot)
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
{
- csnState = ShmemInitStruct("csnState",
- sizeof(CSNSnapshotState),
- &found);
- if (!found)
- {
- csnState->last_max_csn = 0;
- csnState->last_csn_log_wal = 0;
- SpinLockInit(&csnState->lock);
- }
+ csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
+ SpinLockInit(&csnState->lock);
}
}
@@ -104,7 +98,7 @@ GenerateCSN(bool locked)
instr_time current_time;
SnapshotCSN csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
/*
* TODO: create some macro that add small random shift to current time.
@@ -140,7 +134,7 @@ TransactionIdGetXidCSN(TransactionId xid)
{
XidCSN xid_csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
/* Handle permanent TransactionId's for which we don't have mapping */
if (!TransactionIdIsNormal(xid))
@@ -222,7 +216,7 @@ XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
{
XidCSN csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
csn = TransactionIdGetXidCSN(xid);
@@ -277,7 +271,7 @@ void
CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
int nsubxids, TransactionId *subxids)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
@@ -310,7 +304,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
XidCSN oldassignedXidCsn = InProgressXidCSN;
bool in_progress;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/* Set InDoubt status if it is local transaction */
@@ -348,7 +342,7 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
{
volatile XidCSN assigned_xid_csn;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
if (!TransactionIdIsValid(xid))
@@ -391,10 +385,24 @@ get_last_log_wal_csn(void)
}
/*
- * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ *
*/
void
-set_xmin_for_csn(void)
+prepare_csn_env(bool enable)
{
- csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-}
+ TransactionId nextxid = InvalidTransactionId;
+
+ if(enable)
+ {
+ nextxid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ /* 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot */
+ csnState->xmin_for_csn = nextxid;
+ /* produce the csnlog segment we want now and seek to current page */
+ ActivateCSNlog();
+ }
+ else
+ {
+ /* Try to drop all csnlog seg */
+ DeactivateCSNlog();
+ }
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fa6adc09e8..8b459815ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,6 +79,7 @@
#include "utils/relmapper.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "access/csn_log.h"
extern uint32 bootstrap_data_checksum_version;
@@ -6805,6 +6806,9 @@ StartupXLOG(void)
if (ControlFile->track_commit_timestamp)
StartupCommitTs();
+ if(ControlFile->enable_csn_snapshot)
+ StartupCSN();
+
/*
* Recover knowledge about replay progress of known replication partners.
*/
@@ -7921,6 +7925,7 @@ StartupXLOG(void)
* commit timestamp.
*/
CompleteCommitTsInitialization();
+ CompleteCSNInitialization();
/*
* All done with end-of-recovery actions.
@@ -9775,8 +9780,7 @@ XLogReportParameters(void)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
- set_xmin_for_csn();
-
+ prepare_csn_env(enable_csn_snapshot);
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -10218,6 +10222,8 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ CSNlogParameterChange(xlrec.enable_csn_snapshot,
+ ControlFile->enable_csn_snapshot);
ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 9283021c7b..e326b431c2 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1734,7 +1734,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && get_csnlog_status())
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fe6f694b3d..8c1bb48717 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1165,7 +1165,7 @@ static struct config_bool ConfigureNamesBool[] =
gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
},
&enable_csn_snapshot,
- true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ false,
NULL, NULL, NULL
},
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5a0b8e9821..e1c264b300 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -298,6 +298,8 @@
# (change requires restart)
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
+#enable_csn_snapshot = off # enable csn base snapshot
+ # (change requires restart)
# - Primary Server -
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5fa195b913..2a31366930 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -52,6 +52,7 @@
#include "access/transam.h"
#include "access/xact.h"
#include "access/xlog.h"
+#include "access/csn_log.h"
#include "catalog/catalog.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
@@ -2244,7 +2245,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
{
Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
return in_snapshot;
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 2db184b2ee..f8baad6806 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -50,6 +50,13 @@ extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
extern void set_last_max_csn(XidCSN xidcsn);
extern void set_last_log_wal_csn(XidCSN xidcsn);
extern XidCSN get_last_log_wal_csn(void);
-extern void set_xmin_for_csn(void);
+extern void prepare_csn_env(bool enable_csn_snapshot);
+extern void CatchCSNLog(void);
+extern void ActivateCSNlog(void);
+extern void DeactivateCSNlog(void);
+extern void StartupCSN(void);
+extern void CompleteCSNInitialization(void);
+extern void CSNlogParameterChange(bool newvalue, bool oldvalue);
+extern bool get_csnlog_status(void);
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 29de73c060..86e114e934 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
SUBDIRS = \
brin \
commit_ts \
+ csnsnapshot \
dummy_index_am \
dummy_seclabel \
snapshot_too_old \
diff --git a/src/test/modules/csnsnapshot/Makefile b/src/test/modules/csnsnapshot/Makefile
new file mode 100644
index 0000000000..45c4221cd0
--- /dev/null
+++ b/src/test/modules/csnsnapshot/Makefile
@@ -0,0 +1,18 @@
+# src/test/modules/csnsnapshot/Makefile
+
+REGRESS = csnsnapshot
+REGRESS_OPTS = --temp-config=$(top_srcdir)/src/test/modules/csnsnapshot/csn_snapshot.conf
+NO_INSTALLCHECK = 1
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/csnsnapshot
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/csnsnapshot/csn_snapshot.conf b/src/test/modules/csnsnapshot/csn_snapshot.conf
new file mode 100644
index 0000000000..e9d3c35756
--- /dev/null
+++ b/src/test/modules/csnsnapshot/csn_snapshot.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
diff --git a/src/test/modules/csnsnapshot/expected/csnsnapshot.out b/src/test/modules/csnsnapshot/expected/csnsnapshot.out
new file mode 100644
index 0000000000..ac28e417b6
--- /dev/null
+++ b/src/test/modules/csnsnapshot/expected/csnsnapshot.out
@@ -0,0 +1 @@
+create table t1(i int, j int, k varchar);
diff --git a/src/test/modules/csnsnapshot/sql/csnsnapshot.sql b/src/test/modules/csnsnapshot/sql/csnsnapshot.sql
new file mode 100644
index 0000000000..91539b8c30
--- /dev/null
+++ b/src/test/modules/csnsnapshot/sql/csnsnapshot.sql
@@ -0,0 +1 @@
+create table t1(i int, j int, k varchar);
\ No newline at end of file
diff --git a/src/test/modules/csnsnapshot/t/001_base.pl b/src/test/modules/csnsnapshot/t/001_base.pl
new file mode 100644
index 0000000000..1c91f4d9f7
--- /dev/null
+++ b/src/test/modules/csnsnapshot/t/001_base.pl
@@ -0,0 +1,102 @@
+# Single-node test: value can be set, and is still present after recovery
+
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 5;
+use PostgresNode;
+
+my $node = get_new_node('csntest');
+$node->init;
+$node->append_conf('postgresql.conf', qq{
+ enable_csn_snapshot = on
+ csn_snapshot_defer_time = 10
+ max_prepared_transactions = 10
+ });
+$node->start;
+
+my $test_1 = 1;
+
+# Create a table
+$node->safe_psql('postgres', 'create table t1(i int, j int)');
+
+# insert test record
+$node->safe_psql('postgres', 'insert into t1 values(1,1)');
+# export csn snapshot
+my $test_snapshot = $node->safe_psql('postgres', 'select pg_csn_snapshot_export()');
+# insert test record
+$node->safe_psql('postgres', 'insert into t1 values(2,1)');
+
+my $count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '2', 'Get right number in nomal query');
+my $count2 = $node->safe_psql('postgres', "
+ begin transaction isolation level repeatable read;
+ select pg_csn_snapshot_import($test_snapshot);
+ select count(*) from t1;
+ commit;"
+ );
+
+is($count2, '
+1', 'Get right number in csn import query');
+
+#prepare transaction test
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(3,1);
+ insert into t1 values(3,2);
+ prepare transaction 'pt3';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(4,1);
+ insert into t1 values(4,2);
+ prepare transaction 'pt4';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(5,1);
+ insert into t1 values(5,2);
+ prepare transaction 'pt5';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(6,1);
+ insert into t1 values(6,2);
+ prepare transaction 'pt6';
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt4';");
+
+# restart with enable_csn_snapshot off
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = off");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(7,1);
+ insert into t1 values(7,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt3';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '8', 'Get right number in nomal query');
+
+
+# restart with enable_csn_snapshot on
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = on");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(8,1);
+ insert into t1 values(8,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt5';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '12', 'Get right number in nomal query');
+
+# restart with enable_csn_snapshot off
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = on");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(9,1);
+ insert into t1 values(9,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt6';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '16', 'Get right number in nomal query');
diff --git a/src/test/modules/csnsnapshot/t/002_standby.pl b/src/test/modules/csnsnapshot/t/002_standby.pl
new file mode 100644
index 0000000000..b7c4ea93b2
--- /dev/null
+++ b/src/test/modules/csnsnapshot/t/002_standby.pl
@@ -0,0 +1,66 @@
+# Test simple scenario involving a standby
+
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 6;
+use PostgresNode;
+
+my $bkplabel = 'backup';
+my $master = get_new_node('master');
+$master->init(allows_streaming => 1);
+
+$master->append_conf(
+ 'postgresql.conf', qq{
+ enable_csn_snapshot = on
+ max_wal_senders = 5
+ });
+$master->start;
+$master->backup($bkplabel);
+
+my $standby = get_new_node('standby');
+$standby->init_from_backup($master, $bkplabel, has_streaming => 1);
+$standby->start;
+
+$master->safe_psql('postgres', "create table t1(i int, j int)");
+
+my $guc_on_master = $master->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_master, 'on', "GUC on master");
+
+my $guc_on_standby = $standby->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_standby, 'on', "GUC on standby");
+
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = off');
+$master->restart;
+
+$guc_on_master = $master->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_master, 'off', "GUC off master");
+
+$guc_on_standby = $standby->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_standby, 'on', "GUC on standby");
+
+# We consume a large number of transaction,for skip page
+for my $i (1 .. 4096) #4096
+{
+ $master->safe_psql('postgres', "insert into t1 values(1,$i)");
+}
+$master->safe_psql('postgres', "select pg_sleep(2)");
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = on');
+$master->restart;
+
+my $count_standby = $standby->safe_psql('postgres', 'select count(*) from t1');
+is($count_standby, '4096', "Ok for siwtch xid-base > csn-base"); #4096
+
+# We consume a large number of transaction,for skip page
+for my $i (1 .. 4096) #4096
+{
+ $master->safe_psql('postgres', "insert into t1 values(1,$i)");
+}
+$master->safe_psql('postgres', "select pg_sleep(2)");
+
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = off');
+$master->restart;
+
+$count_standby = $standby->safe_psql('postgres', 'select count(*) from t1');
+is($count_standby, '8192', "Ok for siwtch csn-base > xid-base"); #8192
\ No newline at end of file
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index da2e5aa38b..cc169a1999 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,7 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
- enable_csn_snapshot | on
+ enable_csn_snapshot | off
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
0004-globale-snapshot-infrastructure.patchapplication/octet-stream; name=0004-globale-snapshot-infrastructure.patchDownload
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index d7d0b5e90f..eddd1b9d5a 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -50,10 +50,62 @@ typedef struct
static CSNSnapshotState *csnState;
+
/*
- * Enables this module.
+ * GUC to delay advance of oldestXid for this amount of time. Also determines
+ * the size CSNSnapshotXidMap circular buffer.
*/
-extern bool enable_csn_snapshot;
+int csn_snapshot_defer_time;
+
+
+/*
+ * CSNSnapshotXidMap
+ *
+ * To be able to install csn snapshot that points to past we need to keep
+ * old versions of tuples and therefore delay advance of oldestXid. Here we
+ * keep track of correspondence between snapshot's snapshot_csn and oldestXid
+ * that was set at the time when the snapshot was taken. Much like the
+ * snapshot too old's OldSnapshotControlData does, but with finer granularity
+ * to seconds.
+ *
+ * Different strategies can be employed to hold oldestXid (e.g. we can track
+ * oldest csn-based snapshot among cluster nodes and map it oldestXid
+ * on each node).
+ *
+ * On each snapshot acquisition CSNSnapshotMapXmin() is called and stores
+ * correspondence between current snapshot_csn and oldestXmin in a sparse way:
+ * snapshot_csn is rounded to seconds (and here we use the fact that snapshot_csn
+ * is just a timestamp) and oldestXmin is stored in the circular buffer where
+ * rounded snapshot_csn acts as an offset from current circular buffer head.
+ * Size of the circular buffer is controlled by csn_snapshot_defer_time GUC.
+ *
+ * When csn snapshot arrives we check that its
+ * snapshot_csn is still in our map, otherwise we'll error out with "snapshot too
+ * old" message. If snapshot_csn is successfully mapped to oldestXid we move
+ * backend's pgxact->xmin to proc->originalXmin and fill pgxact->xmin to
+ * mapped oldestXid. That way GetOldestXmin() can take into account backends
+ * with imported csn snapshot and old tuple versions will be preserved.
+ *
+ * Also while calculating oldestXmin for our map in presence of imported
+ * csn snapshots we should use proc->originalXmin instead of pgxact->xmin
+ * that was set during import. Otherwise, we can create a feedback loop:
+ * xmin's of imported csn snapshots were calculated using our map and new
+ * entries in map going to be calculated based on that xmin's, and there is
+ * a risk to stuck forever with one non-increasing oldestXmin. All other
+ * callers of GetOldestXmin() are using pgxact->xmin so the old tuple versions
+ * are preserved.
+ */
+typedef struct CSNSnapshotXidMap
+{
+ int head; /* offset of current freshest value */
+ int size; /* total size of circular buffer */
+ CSN_atomic last_csn_seconds; /* last rounded csn that changed
+ * xmin_by_second[] */
+ TransactionId *xmin_by_second; /* circular buffer of oldestXmin's */
+}
+CSNSnapshotXidMap;
+
+static CSNSnapshotXidMap *csnXidMap;
/* Estimate shared memory space needed */
@@ -64,6 +116,13 @@ CSNSnapshotShmemSize(void)
size += MAXALIGN(sizeof(CSNSnapshotState));
+ if (csn_snapshot_defer_time > 0)
+ {
+ size += sizeof(CSNSnapshotXidMap);
+ size += csn_snapshot_defer_time*sizeof(TransactionId);
+ size = MAXALIGN(size);
+ }
+
return size;
}
@@ -73,15 +132,232 @@ CSNSnapshotShmemInit()
{
bool found;
- csnState = ShmemInitStruct("csnState",
- sizeof(CSNSnapshotState),
- &found);
- if (!found)
+ if (true)
+ {
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
+ {
+ csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
+ SpinLockInit(&csnState->lock);
+ }
+ }
+
+ if (csn_snapshot_defer_time > 0)
+ {
+ csnXidMap = ShmemInitStruct("csnXidMap",
+ sizeof(CSNSnapshotXidMap),
+ &found);
+ if (!found)
+ {
+ int i;
+
+ pg_atomic_init_u64(&csnXidMap->last_csn_seconds, 0);
+ csnXidMap->head = 0;
+ csnXidMap->size = csn_snapshot_defer_time;
+ csnXidMap->xmin_by_second =
+ ShmemAlloc(sizeof(TransactionId)*csnXidMap->size);
+
+ for (i = 0; i < csnXidMap->size; i++)
+ csnXidMap->xmin_by_second[i] = InvalidTransactionId;
+ }
+ }
+}
+
+/*
+ * CSNSnapshotStartup
+ *
+ * Set csnXidMap entries to oldestActiveXID during startup.
+ */
+void
+CSNSnapshotStartup(TransactionId oldestActiveXID)
+{
+ /*
+ * Run only if we have initialized shared memory and csnXidMap
+ * is enabled.
+ */
+ if (IsNormalProcessingMode() && csn_snapshot_defer_time > 0)
+ {
+ int i;
+
+ Assert(TransactionIdIsValid(oldestActiveXID));
+ for (i = 0; i < csnXidMap->size; i++)
+ csnXidMap->xmin_by_second[i] = oldestActiveXID;
+ ProcArraySetCSNSnapshotXmin(oldestActiveXID);
+ }
+}
+
+/*
+ * CSNSnapshotMapXmin
+ *
+ * Maintain circular buffer of oldestXmins for several seconds in past. This
+ * buffer allows to shift oldestXmin in the past when backend is importing
+ * CSN snapshot. Otherwise old versions of tuples that were needed for
+ * this transaction can be recycled by other processes (vacuum, HOT, etc).
+ *
+ * Locking here is not trivial. Called upon each snapshot creation after
+ * ProcArrayLock is released. Such usage creates several race conditions. It
+ * is possible that backend who got csn called CSNSnapshotMapXmin()
+ * only after other backends managed to get snapshot and complete
+ * CSNSnapshotMapXmin() call, or even committed. This is safe because
+ *
+ * * We already hold our xmin in MyPgXact, so our snapshot will not be
+ * harmed even though ProcArrayLock is released.
+ *
+ * * snapshot_csn is always pessmistically rounded up to the next
+ * second.
+ *
+ * * For performance reasons, xmin value for particular second is filled
+ * only once. Because of that instead of writing to buffer just our
+ * xmin (which is enough for our snapshot), we bump oldestXmin there --
+ * it mitigates the possibility of damaging someone else's snapshot by
+ * writing to the buffer too advanced value in case of slowness of
+ * another backend who generated csn earlier, but didn't manage to
+ * insert it before us.
+ *
+ * * if CSNSnapshotMapXmin() founds a gap in several seconds between
+ * current call and latest completed call then it should fill that gap
+ * with latest known values instead of new one. Otherwise it is
+ * possible (however highly unlikely) that this gap also happend
+ * between taking snapshot and call to CSNSnapshotMapXmin() for some
+ * backend. And we are at risk to fill circullar buffer with
+ * oldestXmin's that are bigger then they actually were.
+ */
+void
+CSNSnapshotMapXmin(SnapshotCSN snapshot_csn)
+{
+ int offset, gap, i;
+ SnapshotCSN csn_seconds;
+ SnapshotCSN last_csn_seconds;
+ volatile TransactionId oldest_deferred_xmin;
+ TransactionId current_oldest_xmin, previous_oldest_xmin;
+
+ /* Callers should check config values */
+ Assert(csn_snapshot_defer_time > 0);
+ Assert(csnXidMap != NULL);
+ /*
+ * Round up snapshot_csn to the next second -- pessimistically and safely.
+ */
+ csn_seconds = (snapshot_csn / NSECS_PER_SEC + 1);
+
+ /*
+ * Fast-path check. Avoid taking exclusive CSNSnapshotXidMapLock lock
+ * if oldestXid was already written to xmin_by_second[] for this rounded
+ * snapshot_csn.
+ */
+ if (pg_atomic_read_u64(&csnXidMap->last_csn_seconds) >= csn_seconds)
+ return;
+
+ /* Ok, we have new entry (or entries) */
+ LWLockAcquire(CSNSnapshotXidMapLock, LW_EXCLUSIVE);
+
+ /* Re-check last_csn_seconds under lock */
+ last_csn_seconds = pg_atomic_read_u64(&csnXidMap->last_csn_seconds);
+ if (last_csn_seconds >= csn_seconds)
+ {
+ LWLockRelease(CSNSnapshotXidMapLock);
+ return;
+ }
+ pg_atomic_write_u64(&csnXidMap->last_csn_seconds, csn_seconds);
+
+ /*
+ * Count oldest_xmin.
+ *
+ * It was possible to calculate oldest_xmin during corresponding snapshot
+ * creation, but GetSnapshotData() intentionally reads only PgXact, but not
+ * PgProc. And we need info about originalXmin (see comment to csnXidMap)
+ * which is stored in PgProc because of threats in comments around PgXact
+ * about extending it with new fields. So just calculate oldest_xmin again,
+ * that anyway happens quite rarely.
+ */
+ current_oldest_xmin = GetOldestXmin(NULL, PROCARRAY_NON_IMPORTED_XMIN);
+
+ previous_oldest_xmin = csnXidMap->xmin_by_second[csnXidMap->head];
+
+ Assert(TransactionIdIsNormal(current_oldest_xmin));
+ Assert(TransactionIdIsNormal(previous_oldest_xmin));
+
+ gap = csn_seconds - last_csn_seconds;
+ offset = csn_seconds % csnXidMap->size;
+
+ /* Sanity check before we update head and gap */
+ Assert( gap >= 1 );
+ Assert( (csnXidMap->head + gap) % csnXidMap->size == offset );
+
+ gap = gap > csnXidMap->size ? csnXidMap->size : gap;
+ csnXidMap->head = offset;
+
+ /* Fill new entry with current_oldest_xmin */
+ csnXidMap->xmin_by_second[offset] = current_oldest_xmin;
+
+ /*
+ * If we have gap then fill it with previous_oldest_xmin for reasons
+ * outlined in comment above this function.
+ */
+ for (i = 1; i < gap; i++)
+ {
+ offset = (offset + csnXidMap->size - 1) % csnXidMap->size;
+ csnXidMap->xmin_by_second[offset] = previous_oldest_xmin;
+ }
+
+ oldest_deferred_xmin =
+ csnXidMap->xmin_by_second[ (csnXidMap->head + 1) % csnXidMap->size ];
+
+ LWLockRelease(CSNSnapshotXidMapLock);
+
+ /*
+ * Advance procArray->csn_snapshot_xmin after we released
+ * CSNSnapshotXidMapLock. Since we gather not xmin but oldestXmin, it
+ * never goes backwards regardless of how slow we can do that.
+ */
+ Assert(TransactionIdFollowsOrEquals(oldest_deferred_xmin,
+ ProcArrayGetCSNSnapshotXmin()));
+ ProcArraySetCSNSnapshotXmin(oldest_deferred_xmin);
+}
+
+
+/*
+ * CSNSnapshotToXmin
+ *
+ * Get oldestXmin that took place when snapshot_csn was taken.
+ */
+TransactionId
+CSNSnapshotToXmin(SnapshotCSN snapshot_csn)
+{
+ TransactionId xmin;
+ SnapshotCSN csn_seconds;
+ volatile SnapshotCSN last_csn_seconds;
+
+ /* Callers should check config values */
+ Assert(csn_snapshot_defer_time > 0);
+ Assert(csnXidMap != NULL);
+
+ /* Round down to get conservative estimates */
+ csn_seconds = (snapshot_csn / NSECS_PER_SEC);
+
+ LWLockAcquire(CSNSnapshotXidMapLock, LW_SHARED);
+ last_csn_seconds = pg_atomic_read_u64(&csnXidMap->last_csn_seconds);
+ if (csn_seconds > last_csn_seconds)
+ {
+ /* we don't have entry for this snapshot_csn yet, return latest known */
+ xmin = csnXidMap->xmin_by_second[csnXidMap->head];
+ }
+ else if (last_csn_seconds - csn_seconds < csnXidMap->size)
{
- csnState->last_max_csn = 0;
- csnState->last_csn_log_wal = 0;
- SpinLockInit(&csnState->lock);
+ /* we are good, retrieve value from our map */
+ Assert(last_csn_seconds % csnXidMap->size == csnXidMap->head);
+ xmin = csnXidMap->xmin_by_second[csn_seconds % csnXidMap->size];
}
+ else
+ {
+ /* requested snapshot_csn is too old, let caller know */
+ xmin = InvalidTransactionId;
+ }
+ LWLockRelease(CSNSnapshotXidMapLock);
+
+ return xmin;
}
/*
@@ -98,7 +374,7 @@ GenerateCSN(bool locked)
instr_time current_time;
SnapshotCSN csn;
- Assert(get_csnlog_status());
+ Assert(get_csnlog_status() || csn_snapshot_defer_time > 0);
/*
* TODO: create some macro that add small random shift to current time.
@@ -123,6 +399,125 @@ GenerateCSN(bool locked)
return csn;
}
+/*
+ * CSNSnapshotPrepareCurrent
+ *
+ * Set InDoubt state for currently active transaction and return commit's
+ * global snapshot.
+ */
+SnapshotCSN
+CSNSnapshotPrepareCurrent(void)
+{
+ TransactionId xid = GetCurrentTransactionIdIfAny();
+
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (TransactionIdIsValid(xid))
+ {
+ TransactionId *subxids;
+ int nsubxids = xactGetCommittedChildren(&subxids);
+ CSNLogSetCSN(xid, nsubxids, subxids, InDoubtXidCSN, true);
+ }
+
+ /* Nothing to write if we don't heve xid */
+
+ return GenerateCSN(false);
+}
+
+
+/*
+ * CSNSnapshotAssignCsnCurrent
+ *
+ * Asign SnapshotCSN for currently active transaction. SnapshotCSN is supposedly
+ * maximal among of values returned by CSNSnapshotPrepareCurrent and
+ * pg_global_snapshot_prepare.
+ */
+void
+CSNSnapshotAssignCsnCurrent(SnapshotCSN snapshot_csn)
+{
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (!XidCSNIsNormal(snapshot_csn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("pg_global_snapshot_assign expects normal snapshot_csn")));
+
+ /* Skip emtpty transactions */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return;
+
+ /* Set global_csn and defuse ProcArrayEndTransaction from assigning one */
+ pg_atomic_write_u64(&MyProc->assignedXidCsn, snapshot_csn);
+}
+
+/*
+ * CSNSnapshotSync
+ *
+ * Due to time desynchronization on different nodes we can receive snapshot_csn
+ * which is greater than snapshot_csn on this node. To preserve proper isolation
+ * this node needs to wait when such snapshot_csn comes on local clock.
+ *
+ * This should happend relatively rare if nodes have running NTP/PTP/etc.
+ * Complain if wait time is more than SNAP_SYNC_COMPLAIN.
+ */
+void
+CSNSnapshotSync(SnapshotCSN remote_csn)
+{
+ SnapshotCSN local_csn;
+ SnapshotCSN delta;
+
+ Assert(enable_csn_snapshot);
+
+ for(;;)
+ {
+ SpinLockAcquire(&csnState->lock);
+ if (csnState->last_max_csn > remote_csn)
+ {
+ /* Everything is fine */
+ SpinLockRelease(&csnState->lock);
+ return;
+ }
+ else if ((local_csn = GenerateCSN(true)) >= remote_csn)
+ {
+ /*
+ * Everything is fine too, but last_max_csn wasn't updated for
+ * some time.
+ */
+ SpinLockRelease(&csnState->lock);
+ return;
+ }
+ SpinLockRelease(&csnState->lock);
+
+ /* Okay we need to sleep now */
+ delta = remote_csn - local_csn;
+ if (delta > SNAP_DESYNC_COMPLAIN)
+ ereport(WARNING,
+ (errmsg("remote global snapshot exceeds ours by more than a second"),
+ errhint("Consider running NTPd on servers participating in global transaction")));
+
+ /* TODO: report this sleeptime somewhere? */
+ pg_usleep((long) (delta/NSECS_PER_USEC));
+
+ /*
+ * Loop that checks to ensure that we actually slept for specified
+ * amount of time.
+ */
+ }
+
+ Assert(false); /* Should not happend */
+ return;
+}
+
/*
* TransactionIdGetXidCSN
*
@@ -405,4 +800,4 @@ prepare_csn_env(bool enable)
/* Try to drop all csnlog seg */
DeactivateCSNlog();
}
-}
\ No newline at end of file
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 57bda5d422..7f90520beb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2469,3 +2469,128 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
RemoveTwoPhaseFile(xid, giveWarning);
RemoveGXact(gxact);
}
+
+/*
+ * CSNSnapshotPrepareTwophase
+ *
+ * Set InDoubt state for currently active transaction and return commit's
+ * global snapshot.
+ */
+static SnapshotCSN
+CSNSnapshotPrepareTwophase(const char *gid)
+{
+ GlobalTransaction gxact;
+ PGXACT *pgxact;
+ char *buf;
+ TransactionId xid;
+ xl_xact_parsed_prepare parsed;
+
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ /*
+ * Validate the GID, and lock the GXACT to ensure that two backends do not
+ * try to access the same GID at once.
+ */
+ gxact = LockGXact(gid, GetUserId());
+ pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+ xid = pgxact->xid;
+
+ if (gxact->ondisk)
+ buf = ReadTwoPhaseFile(xid, true);
+ else
+ XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+
+ ParsePrepareRecord(0, (xl_xact_prepare *)buf, &parsed);
+
+ CSNLogSetCSN(xid, parsed.nsubxacts,
+ parsed.subxacts, InDoubtXidCSN, true);
+
+ /* Unlock our GXACT */
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+ gxact->locking_backend = InvalidBackendId;
+ LWLockRelease(TwoPhaseStateLock);
+
+ pfree(buf);
+
+ return GenerateCSN(false);
+}
+
+/*
+ * TwoPhaseAssignGlobalCsn
+ *
+ * Asign SnapshotCSN for currently active transaction. SnapshotCSN is supposedly
+ * maximal among of values returned by CSNSnapshotPrepareCurrent and
+ * pg_global_snapshot_prepare.
+ *
+ * This function is a counterpart of GlobalSnapshotAssignCsnCurrent() for
+ * twophase transactions.
+ */
+static void
+CSNSnapshotAssignCsnTwoPhase(const char *gid, SnapshotCSN snapshot_csn)
+{
+ GlobalTransaction gxact;
+ PGPROC *proc;
+
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (!XidCSNIsNormal(snapshot_csn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("pg_csn_snapshot_assign expects normal snapshot_csn")));
+
+ /*
+ * Validate the GID, and lock the GXACT to ensure that two backends do not
+ * try to access the same GID at once.
+ */
+ gxact = LockGXact(gid, GetUserId());
+ proc = &ProcGlobal->allProcs[gxact->pgprocno];
+
+ /* Set snapshot_csn and defuse ProcArrayRemove from assigning one. */
+ pg_atomic_write_u64(&proc->assignedXidCsn, snapshot_csn);
+
+ /* Unlock our GXACT */
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+ gxact->locking_backend = InvalidBackendId;
+ LWLockRelease(TwoPhaseStateLock);
+}
+
+/*
+ * SQL interface to CSNSnapshotPrepareTwophase()
+ *
+ * TODO: Rewrite this as PREPARE TRANSACTION 'gid' RETURNING SNAPSHOT
+ */
+Datum
+pg_csn_snapshot_prepare(PG_FUNCTION_ARGS)
+{
+ const char *gid = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ SnapshotCSN snapshot_csn;
+
+ snapshot_csn = CSNSnapshotPrepareTwophase(gid);
+
+ PG_RETURN_INT64(snapshot_csn);
+}
+
+/*
+ * SQL interface to CSNSnapshotAssignCsnTwoPhase()
+ *
+ * TODO: Rewrite this as COMMIT PREPARED 'gid' SNAPSHOT 'global_csn'
+ */
+Datum
+pg_csn_snapshot_assign(PG_FUNCTION_ARGS)
+{
+ const char *gid = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ SnapshotCSN snapshot_csn = PG_GETARG_INT64(1);
+
+ CSNSnapshotAssignCsnTwoPhase(gid, snapshot_csn);
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7435987359..3ea3220a1a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7070,6 +7070,7 @@ StartupXLOG(void)
*/
StartupCLOG();
StartupSUBTRANS(oldestActiveXID);
+ CSNSnapshotStartup(oldestActiveXID);
/*
* If we're beginning at a shutdown checkpoint, we know that
@@ -7887,6 +7888,7 @@ StartupXLOG(void)
{
StartupCLOG();
StartupSUBTRANS(oldestActiveXID);
+ CSNSnapshotStartup(oldestActiveXID);
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e326b431c2..b36a85cd01 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -96,6 +96,8 @@ typedef struct ProcArrayStruct
TransactionId replication_slot_xmin;
/* oldest catalog xmin of any replication slot */
TransactionId replication_slot_catalog_xmin;
+ /* xmin of oldest active csn snapshot */
+ TransactionId csn_snapshot_xmin;
/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
int pgprocnos[FLEXIBLE_ARRAY_MEMBER];
@@ -251,6 +253,7 @@ CreateSharedProcArray(void)
procArray->lastOverflowedXid = InvalidTransactionId;
procArray->replication_slot_xmin = InvalidTransactionId;
procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+ procArray->csn_snapshot_xmin = InvalidTransactionId;
}
allProcs = ProcGlobal->allProcs;
@@ -442,6 +445,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->originalXmin = InvalidTransactionId;
+
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
proc->delayChkpt = false; /* be sure this is cleared in abort */
@@ -464,6 +469,8 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
pgxact->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->originalXmin = InvalidTransactionId;
+
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
proc->delayChkpt = false; /* be sure this is cleared in abort */
@@ -630,6 +637,7 @@ ProcArrayClearTransaction(PGPROC *proc)
pgxact->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->originalXmin = InvalidTransactionId;
proc->recoveryConflictPending = false;
/* redundant, but just in case */
@@ -1332,6 +1340,7 @@ GetOldestXmin(Relation rel, int flags)
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+ TransactionId csn_snapshot_xmin = InvalidTransactionId;
/*
* If we're not computing a relation specific limit, or if a shared
@@ -1370,6 +1379,7 @@ GetOldestXmin(Relation rel, int flags)
{
/* Fetch xid just once - see GetNewTransactionId */
TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
+ TransactionId original_xmin = UINT32_ACCESS_ONCE(proc->originalXmin);
/* First consider the transaction's own Xid, if any */
if (TransactionIdIsNormal(xid) &&
@@ -1382,8 +1392,17 @@ GetOldestXmin(Relation rel, int flags)
* We must check both Xid and Xmin because a transaction might
* have an Xmin but not (yet) an Xid; conversely, if it has an
* Xid, that could determine some not-yet-set Xmin.
+ *
+ * In case of oldestXmin calculation for CSNSnapshotMapXmin()
+ * pgxact->xmin should be changed to proc->originalXmin. Details
+ * in commets to CSNSnapshotMapXmin.
*/
- xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+ if ((flags & PROCARRAY_NON_IMPORTED_XMIN) &&
+ TransactionIdIsValid(original_xmin))
+ xid = original_xmin;
+ else
+ xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+
if (TransactionIdIsNormal(xid) &&
TransactionIdPrecedes(xid, result))
result = xid;
@@ -1397,6 +1416,7 @@ GetOldestXmin(Relation rel, int flags)
*/
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+ csn_snapshot_xmin = procArray->csn_snapshot_xmin;
if (RecoveryInProgress())
{
@@ -1438,6 +1458,11 @@ GetOldestXmin(Relation rel, int flags)
result = FirstNormalTransactionId;
}
+ if (!(flags & PROCARRAY_NON_IMPORTED_XMIN) &&
+ TransactionIdIsValid(csn_snapshot_xmin) &&
+ NormalTransactionIdPrecedes(csn_snapshot_xmin, result))
+ result = csn_snapshot_xmin;
+
/*
* Check whether there are replication slots requiring an older xmin.
*/
@@ -1535,6 +1560,7 @@ GetSnapshotData(Snapshot snapshot)
XidCSN xid_csn = FrozenXidCSN;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+ TransactionId csn_snapshot_xmin = InvalidTransactionId;
Assert(snapshot != NULL);
@@ -1726,6 +1752,7 @@ GetSnapshotData(Snapshot snapshot)
*/
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+ csn_snapshot_xmin = procArray->csn_snapshot_xmin;
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
@@ -1752,6 +1779,10 @@ GetSnapshotData(Snapshot snapshot)
if (!TransactionIdIsNormal(RecentGlobalXmin))
RecentGlobalXmin = FirstNormalTransactionId;
+ if (TransactionIdIsValid(csn_snapshot_xmin) &&
+ TransactionIdPrecedes(csn_snapshot_xmin, RecentGlobalXmin))
+ RecentGlobalXmin = csn_snapshot_xmin;
+
/* Check whether there's a replication slot requiring an older xmin. */
if (TransactionIdIsValid(replication_slot_xmin) &&
NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
@@ -1807,7 +1838,10 @@ GetSnapshotData(Snapshot snapshot)
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
+ snapshot->imported_snapshot_csn = false;
snapshot->snapshot_csn = xid_csn;
+ if (csn_snapshot_defer_time > 0 && IsUnderPostmaster)
+ CSNSnapshotMapXmin(snapshot->snapshot_csn);
return snapshot;
}
@@ -3156,6 +3190,24 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
LWLockRelease(ProcArrayLock);
}
+/*
+ * ProcArraySetCSNSnapshotXmin
+ */
+void
+ProcArraySetCSNSnapshotXmin(TransactionId xmin)
+{
+ /* We rely on atomic fetch/store of xid */
+ procArray->csn_snapshot_xmin = xmin;
+}
+
+/*
+ * ProcArrayGetCSNSnapshotXmin
+ */
+TransactionId
+ProcArrayGetCSNSnapshotXmin(void)
+{
+ return procArray->csn_snapshot_xmin;
+}
#define XidCacheRemove(i) \
do { \
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 3c95ce4aac..e048a2276d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -51,3 +51,4 @@ OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
XactTruncationLock 44
CSNLogControlLock 45
+CSNSnapshotXidMapLock 46
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2a31366930..2bfafa69c1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2394,3 +2394,98 @@ XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
return false;
}
+
+
+/*
+ * ExportCSNSnapshot
+ *
+ * Export snapshot_csn so that caller can expand this transaction to other
+ * nodes.
+ *
+ * TODO: it's better to do this through EXPORT/IMPORT SNAPSHOT syntax and
+ * add some additional checks that transaction did not yet acquired xid, but
+ * for current iteration of this patch I don't want to hack on parser.
+ */
+SnapshotCSN
+ExportCSNSnapshot()
+{
+ if (!get_csnlog_status())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not export csn snapshot"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ return CurrentSnapshot->snapshot_csn;
+}
+
+/* SQL accessor to ExportCSNSnapshot() */
+Datum
+pg_csn_snapshot_export(PG_FUNCTION_ARGS)
+{
+ SnapshotCSN export_csn = ExportCSNSnapshot();
+ PG_RETURN_UINT64(export_csn);
+}
+
+/*
+ * ImportCSNSnapshot
+ *
+ * Import csn and retract this backends xmin to the value that was
+ * actual when we had such csn.
+ *
+ * TODO: it's better to do this through EXPORT/IMPORT SNAPSHOT syntax and
+ * add some additional checks that transaction did not yet acquired xid, but
+ * for current iteration of this patch I don't want to hack on parser.
+ */
+void
+ImportCSNSnapshot(SnapshotCSN snapshot_csn)
+{
+ volatile TransactionId xmin;
+
+ if (!get_csnlog_status())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not import csn snapshot"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (csn_snapshot_defer_time <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not import csn snapshot"),
+ errhint("Make sure the configuration parameter \"%s\" is positive.",
+ "csn_snapshot_defer_time")));
+
+ /*
+ * Call CSNSnapshotToXmin under ProcArrayLock to avoid situation that
+ * resulting xmin will be evicted from map before we will set it into our
+ * backend's xmin.
+ */
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+ xmin = CSNSnapshotToXmin(snapshot_csn);
+ if (!TransactionIdIsValid(xmin))
+ {
+ LWLockRelease(ProcArrayLock);
+ elog(ERROR, "CSNSnapshotToXmin: csn snapshot too old");
+ }
+ MyProc->originalXmin = MyPgXact->xmin;
+ MyPgXact->xmin = TransactionXmin = xmin;
+ LWLockRelease(ProcArrayLock);
+
+ CurrentSnapshot->xmin = xmin; /* defuse SnapshotResetXmin() */
+ CurrentSnapshot->snapshot_csn = snapshot_csn;
+ CurrentSnapshot->imported_snapshot_csn = true;
+ CSNSnapshotSync(snapshot_csn);
+
+ Assert(TransactionIdPrecedesOrEquals(RecentGlobalXmin, xmin));
+ Assert(TransactionIdPrecedesOrEquals(RecentGlobalDataXmin, xmin));
+}
+
+/* SQL accessor to ImportCSNSnapshot() */
+Datum
+pg_csn_snapshot_import(PG_FUNCTION_ARGS)
+{
+ SnapshotCSN snapshot_csn = PG_GETARG_UINT64(0);
+ ImportCSNSnapshot(snapshot_csn);
+ PG_RETURN_VOID();
+}
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
index 1894586204..91561683b3 100644
--- a/src/include/access/csn_snapshot.h
+++ b/src/include/access/csn_snapshot.h
@@ -37,10 +37,15 @@ typedef pg_atomic_uint64 CSN_atomic;
#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
+extern int csn_snapshot_defer_time;
extern Size CSNSnapshotShmemSize(void);
extern void CSNSnapshotShmemInit(void);
+extern void CSNSnapshotStartup(TransactionId oldestActiveXID);
+
+extern void CSNSnapshotMapXmin(SnapshotCSN snapshot_csn);
+extern TransactionId CSNSnapshotToXmin(SnapshotCSN snapshot_csn);
extern SnapshotCSN GenerateCSN(bool locked);
@@ -54,5 +59,8 @@ extern void CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
TransactionId *subxids);
extern void CSNSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
TransactionId *subxids);
+extern void CSNSnapshotAssignCsnCurrent(SnapshotCSN snapshot_csn);
+extern SnapshotCSN CSNSnapshotPrepareCurrent(void);
+extern void CSNSnapshotSync(SnapshotCSN remote_csn);
#endif /* CSN_SNAPSHOT_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 95604e988a..17e85486ae 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10953,4 +10953,18 @@
proname => 'is_normalized', prorettype => 'bool', proargtypes => 'text text',
prosrc => 'unicode_is_normalized' },
+# csn shnapshot handling
+{ oid => '4179', descr => 'export csn snapshot',
+ proname => 'pg_csn_snapshot_export', provolatile => 'v', proparallel => 'u',
+ prorettype => 'int8', proargtypes => '', prosrc => 'pg_csn_snapshot_export' },
+{ oid => '4180', descr => 'import csn snapshot',
+ proname => 'pg_csn_snapshot_import', provolatile => 'v', proparallel => 'u',
+ prorettype => 'void', proargtypes => 'int8', prosrc => 'pg_csn_snapshot_import' },
+{ oid => '4198', descr => 'prepare distributed transaction for commit, get global_csn',
+ proname => 'pg_csn_snapshot_prepare', provolatile => 'v', proparallel => 'u',
+ prorettype => 'int8', proargtypes => 'text', prosrc => 'pg_csn_snapshot_prepare' },
+{ oid => '4199', descr => 'assign global_csn to distributed transaction',
+ proname => 'pg_csn_snapshot_assign', provolatile => 'v', proparallel => 'u',
+ prorettype => 'void', proargtypes => 'text int8', prosrc => 'pg_csn_snapshot_assign' },
+
]
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3ff7ea4fce..30bcbbfe15 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -222,6 +222,8 @@ struct PGPROC
*/
CSN_atomic assignedXidCsn;
+ /* Original xmin of this backend before csn snapshot was imported */
+ TransactionId originalXmin;
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c064..35dc1dcc40 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -36,6 +36,9 @@
#define PROCARRAY_SLOTS_XMIN 0x20 /* replication slot xmin,
* catalog_xmin */
+#define PROCARRAY_NON_IMPORTED_XMIN 0x80 /* use originalXmin instead
+ * of xmin to properly
+ * maintain csnXidMap */
/*
* Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
* PGXACT->vacuumFlags. Other flags are used for different purposes and
@@ -125,4 +128,6 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
TransactionId *catalog_xmin);
+extern void ProcArraySetCSNSnapshotXmin(TransactionId xmin);
+extern TransactionId ProcArrayGetCSNSnapshotXmin(void);
#endif /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index ffb4ba3adf..0e37ebad07 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -127,6 +127,8 @@ extern void AtSubCommit_Snapshot(int level);
extern void AtSubAbort_Snapshot(int level);
extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
+extern SnapshotCSN ExportCSNSnapshot(void);
+extern void ImportCSNSnapshot(SnapshotCSN snapshot_csn);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 9f622c76a7..2eef33c4b6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -210,6 +210,8 @@ typedef struct SnapshotData
* Will be used only if enable_csn_snapshot is enabled.
*/
SnapshotCSN snapshot_csn;
+ /* Did we have our own snapshot_csn or imported one from different node */
+ bool imported_snapshot_csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
0001-CSN-base-snapshot.patchapplication/octet-stream; name=0001-CSN-base-snapshot.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..fc0321ee6b 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,8 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
clog.o \
commit_ts.o \
+ csn_log.o \
+ csn_snapshot.o \
generic_xlog.o \
multixact.o \
parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..4e0b8d64e4
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,438 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ * Track commit sequence numbers of finished transactions
+ *
+ * This module provides SLRU to store CSN for each transaction. This
+ * mapping need to be kept only for xid's greater then oldestXid, but
+ * that can require arbitrary large amounts of memory in case of long-lived
+ * transactions. Because of same lifetime and persistancy requirements
+ * this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool enable_csn_snapshot;
+
+/*
+ * Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XidCSN))
+
+#define TransactionIdToPage(xid) ((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int page1, int page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
+ int slotno);
+
+/*
+ * CSNLogSetCSN
+ *
+ * Record XidCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should be
+ * AbortedCSN for abort cases.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ int pageno;
+ int i = 0;
+ int offset = 0;
+
+ /* Callers of CSNLogSetCSN() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ Assert(TransactionIdIsValid(xid));
+
+ pageno = TransactionIdToPage(xid); /* get page of parent */
+ for (;;)
+ {
+ int num_on_page = 0;
+
+ while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+ {
+ num_on_page++;
+ i++;
+ }
+
+ CSNLogSetPageStatus(xid,
+ num_on_page, subxids + offset,
+ csn, pageno);
+ if (i >= nsubxids)
+ break;
+
+ offset = i;
+ pageno = TransactionIdToPage(subxids[offset]);
+ xid = InvalidTransactionId;
+ }
+}
+
+/*
+ * Record the final state of transaction entries in the csn log for
+ * all entries on a single page. Atomic only on this page.
+ *
+ * Otherwise API is same as TransactionIdSetTreeStatus()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno)
+{
+ int slotno;
+ int i;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+ /* Subtransactions first, if needed ... */
+ for (i = 0; i < nsubxids; i++)
+ {
+ Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+ CSNLogSetCSNInSlot(subxids[i], csn, slotno);
+ }
+
+ /* ... then the main transaction */
+ if (TransactionIdIsValid(xid))
+ CSNLogSetCSNInSlot(xid, csn, slotno);
+
+ CsnlogCtl->shared->page_dirty[slotno] = true;
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
+{
+ int entryno = TransactionIdToPgIndex(xid);
+ XidCSN *ptr;
+
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+ *ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XidCSN
+CSNLogGetCSNByXid(TransactionId xid)
+{
+ int pageno = TransactionIdToPage(xid);
+ int entryno = TransactionIdToPgIndex(xid);
+ int slotno;
+ XidCSN *ptr;
+ XidCSN xid_csn;
+
+ /* Callers of CSNLogGetCSNByXid() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ /* Can't ask about stuff that might not be around anymore */
+ Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+ slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+ xid_csn = *ptr;
+
+ LWLockRelease(CSNLogControlLock);
+
+ return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+ return Min(32, Max(4, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+ if (!enable_csn_snapshot)
+ return 0;
+
+ return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+ SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+ CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+}
+
+/*
+ * This func must be called ONCE on system install. It creates the initial
+ * CSNLog segment. The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ */
+void
+BootStrapCSNLog(void)
+{
+ int slotno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Create and zero the first page of the commit log */
+ slotno = ZeroCSNLogPage(0);
+
+ /* Make sure it's written out */
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+ return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID)
+{
+ int startPage;
+ int endPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Since we don't expect pg_csn to be valid across crashes, we
+ * initialize the currently-active page(s) to zeroes during startup.
+ * Whenever we advance into a new page, ExtendCSNLog will likewise
+ * zero the new page without regard to whatever was previously on disk.
+ */
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ startPage = TransactionIdToPage(oldestActiveXID);
+ endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
+
+ while (startPage != endPage)
+ {
+ (void) ZeroCSNLogPage(startPage);
+ startPage++;
+ /* must account for wraparound */
+ if (startPage > TransactionIdToPage(MaxTransactionId))
+ startPage = 0;
+ }
+ (void) ZeroCSNLogPage(startPage);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely as a debugging aid.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+ SimpleLruFlush(CsnlogCtl, false);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely to improve the odds that writing of dirty pages is done by
+ * the checkpoint process and not by backends.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+ SimpleLruFlush(CsnlogCtl, true);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock. We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+ int pageno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * No work except at first XID of a page. But beware: just after
+ * wraparound, the first XID of page zero is FirstNormalTransactionId.
+ */
+ if (TransactionIdToPgIndex(newestXact) != 0 &&
+ !TransactionIdEquals(newestXact, FirstNormalTransactionId))
+ return;
+
+ pageno = TransactionIdToPage(newestXact);
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Zero the page and make an XLOG entry about it */
+ ZeroCSNLogPage(pageno);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+ int cutoffPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * The cutoff point is the start of the segment containing oldestXact. We
+ * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+ * back one transaction to avoid passing a cutoff page that hasn't been
+ * created yet in the rare case that oldestXact would be the first item on
+ * a page and oldestXact == next XID. In that case, if we didn't subtract
+ * one, we'd trigger SimpleLruTruncate's wraparound detection.
+ */
+ TransactionIdRetreat(oldestXact);
+ cutoffPage = TransactionIdToPage(oldestXact);
+
+ SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic. However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs. So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int page1, int page2)
+{
+ TransactionId xid1;
+ TransactionId xid2;
+
+ xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+ xid1 += FirstNormalTransactionId;
+ xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+ xid2 += FirstNormalTransactionId;
+
+ return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
new file mode 100644
index 0000000000..bcc5bac757
--- /dev/null
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -0,0 +1,340 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.c
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_snapshot.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
+#include "access/transam.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "portability/instr_time.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+#include "miscadmin.h"
+
+/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
+#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+
+/*
+ * CSNSnapshotState
+ *
+ * Do not trust local clocks to be strictly monotonical and save last acquired
+ * value so later we can compare next timestamp with it. Accessed through
+ * GenerateCSN().
+ */
+typedef struct
+{
+ SnapshotCSN last_max_csn;
+ volatile slock_t lock;
+} CSNSnapshotState;
+
+static CSNSnapshotState *csnState;
+
+/*
+ * Enables this module.
+ */
+extern bool enable_csn_snapshot;
+
+
+/* Estimate shared memory space needed */
+Size
+CSNSnapshotShmemSize(void)
+{
+ Size size = 0;
+
+ if (enable_csn_snapshot)
+ {
+ size += MAXALIGN(sizeof(CSNSnapshotState));
+ }
+
+ return size;
+}
+
+/* Init shared memory structures */
+void
+CSNSnapshotShmemInit()
+{
+ bool found;
+
+ if (enable_csn_snapshot)
+ {
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
+ {
+ csnState->last_max_csn = 0;
+ SpinLockInit(&csnState->lock);
+ }
+ }
+}
+
+/*
+ * GenerateCSN
+ *
+ * Generate SnapshotCSN which is actually a local time. Also we are forcing
+ * this time to be always increasing. Since now it is not uncommon to have
+ * millions of read transactions per second we are trying to use nanoseconds
+ * if such time resolution is available.
+ */
+SnapshotCSN
+GenerateCSN(bool locked)
+{
+ instr_time current_time;
+ SnapshotCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ /*
+ * TODO: create some macro that add small random shift to current time.
+ */
+ INSTR_TIME_SET_CURRENT(current_time);
+ csn = (SnapshotCSN) INSTR_TIME_GET_NANOSEC(current_time);
+
+ /* TODO: change to atomics? */
+ if (!locked)
+ SpinLockAcquire(&csnState->lock);
+
+ if (csn <= csnState->last_max_csn)
+ csn = ++csnState->last_max_csn;
+ else
+ csnState->last_max_csn = csn;
+
+ if (!locked)
+ SpinLockRelease(&csnState->lock);
+
+ return csn;
+}
+
+/*
+ * TransactionIdGetXidCSN
+ *
+ * Get XidCSN for specified TransactionId taking care about special xids,
+ * xids beyond TransactionXmin and InDoubt states.
+ */
+XidCSN
+TransactionIdGetXidCSN(TransactionId xid)
+{
+ XidCSN xid_csn;
+
+ Assert(enable_csn_snapshot);
+
+ /* Handle permanent TransactionId's for which we don't have mapping */
+ if (!TransactionIdIsNormal(xid))
+ {
+ if (xid == InvalidTransactionId)
+ return AbortedXidCSN;
+ if (xid == FrozenTransactionId || xid == BootstrapTransactionId)
+ return FrozenXidCSN;
+ Assert(false); /* Should not happend */
+ }
+
+ /*
+ * For xids which less then TransactionXmin CSNLog can be already
+ * trimmed but we know that such transaction is definetly not concurrently
+ * running according to any snapshot including timetravel ones. Callers
+ * should check TransactionDidCommit after.
+ */
+ if (TransactionIdPrecedes(xid, TransactionXmin))
+ return FrozenXidCSN;
+
+ /* Read XidCSN from SLRU */
+ xid_csn = CSNLogGetCSNByXid(xid);
+
+ /*
+ * If we faced InDoubt state then transaction is beeing committed and we
+ * should wait until XidCSN will be assigned so that visibility check
+ * could decide whether tuple is in snapshot. See also comments in
+ * CSNSnapshotPrecommit().
+ */
+ if (XidCSNIsInDoubt(xid_csn))
+ {
+ XactLockTableWait(xid, NULL, NULL, XLTW_None);
+ xid_csn = CSNLogGetCSNByXid(xid);
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+ }
+
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsInProgress(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+
+ return xid_csn;
+}
+
+/*
+ * XidInvisibleInCSNSnapshot
+ *
+ * Version of XidInMVCCSnapshot for transactions. For non-imported
+ * csn snapshots this should give same results as XidInLocalMVCCSnapshot
+ * (except that aborts will be shown as invisible without going to clog) and to
+ * ensure such behaviour XidInMVCCSnapshot is coated with asserts that checks
+ * identicalness of XidInvisibleInCSNSnapshot/XidInLocalMVCCSnapshot in
+ * case of ordinary snapshot.
+ */
+bool
+XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ XidCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ csn = TransactionIdGetXidCSN(xid);
+
+ if (XidCSNIsNormal(csn))
+ {
+ if (csn < snapshot->snapshot_csn)
+ return false;
+ else
+ return true;
+ }
+ else if (XidCSNIsFrozen(csn))
+ {
+ /* It is bootstrap or frozen transaction */
+ return false;
+ }
+ else
+ {
+ /* It is aborted or in-progress */
+ Assert(XidCSNIsAborted(csn) || XidCSNIsInProgress(csn));
+ if (XidCSNIsAborted(csn))
+ Assert(TransactionIdDidAbort(xid));
+ return true;
+ }
+}
+
+
+/*****************************************************************************
+ * Functions to handle transactions commit.
+ *
+ * For local transactions CSNSnapshotPrecommit sets InDoubt state before
+ * ProcArrayEndTransaction is called and transaction data potetntially becomes
+ * visible to other backends. ProcArrayEndTransaction (or ProcArrayRemove in
+ * twophase case) then acquires xid_csn under ProcArray lock and stores it
+ * in proc->assignedXidCsn. It's important that xid_csn for commit is
+ * generated under ProcArray lock, otherwise snapshots won't
+ * be equivalent. Consequent call to CSNSnapshotCommit will write
+ * proc->assignedXidCsn to CSNLog.
+ *
+ *
+ * CSNSnapshotAbort is slightly different comparing to commit because abort
+ * can skip InDoubt phase and can be called for transaction subtree.
+ *****************************************************************************/
+
+
+/*
+ * CSNSnapshotAbort
+ *
+ * Abort transaction in CsnLog. We can skip InDoubt state for aborts
+ * since no concurrent transactions allowed to see aborted data anyway.
+ */
+void
+CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+
+ /*
+ * Clean assignedXidCsn anyway, as it was possibly set in
+ * XidSnapshotAssignCsnCurrent.
+ */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
+
+/*
+ * CSNSnapshotPrecommit
+ *
+ * Set InDoubt status for local transaction that we are going to commit.
+ * This step is needed to achieve consistency between local snapshots and
+ * csn-based snapshots. We don't hold ProcArray lock while writing
+ * csn for transaction in SLRU but instead we set InDoubt status before
+ * transaction is deleted from ProcArray so the readers who will read csn
+ * in the gap between ProcArray removal and XidCSN assignment can wait
+ * until XidCSN is finally assigned. See also TransactionIdGetXidCSN().
+ *
+ * This should be called only from parallel group leader before backend is
+ * deleted from ProcArray.
+ */
+void
+CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ XidCSN oldassignedXidCsn = InProgressXidCSN;
+ bool in_progress;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /* Set InDoubt status if it is local transaction */
+ in_progress = pg_atomic_compare_exchange_u64(&proc->assignedXidCsn,
+ &oldassignedXidCsn,
+ InDoubtXidCSN);
+ if (in_progress)
+ {
+ Assert(XidCSNIsInProgress(oldassignedXidCsn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, InDoubtXidCSN);
+ }
+ else
+ {
+ /* Otherwise we should have valid XidCSN by this time */
+ Assert(XidCSNIsNormal(oldassignedXidCsn));
+ Assert(XidCSNIsInDoubt(CSNLogGetCSNByXid(xid)));
+ }
+}
+
+/*
+ * CSNSnapshotCommit
+ *
+ * Write XidCSN that were acquired earlier to CsnLog. Should be
+ * preceded by CSNSnapshotPrecommit() so readers can wait until we finally
+ * finished writing to SLRU.
+ *
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, so that TransactionIdGetXidCSN can wait on this
+ * lock for XidCSN.
+ */
+void
+CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ volatile XidCSN assigned_xid_csn;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ if (!TransactionIdIsValid(xid))
+ {
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsInProgress(assigned_xid_csn));
+ return;
+ }
+
+ /* Finally write resulting XidCSN in SLRU */
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsNormal(assigned_xid_csn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, assigned_xid_csn);
+
+ /* Reset for next transaction */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0e..57bda5d422 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,8 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
+#include "access/csn_log.h"
#include "access/htup_details.h"
#include "access/subtrans.h"
#include "access/transam.h"
@@ -1479,8 +1481,34 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nabortrels, abortrels,
gid);
+ /*
+ * CSNSnapshot callbacks that should be called right before we are
+ * going to become visible. Details in comments to this functions.
+ */
+ if (isCommit)
+ CSNSnapshotPrecommit(proc, xid, hdr->nsubxacts, children);
+ else
+ CSNSnapshotAbort(proc, xid, hdr->nsubxacts, children);
+
+
ProcArrayRemove(proc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CSNLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, since TransactionIdGetXidCSN relies on
+ * XactLockTableWait to await xid_csn.
+ */
+ if (isCommit)
+ {
+ CSNSnapshotCommit(proc, xid, hdr->nsubxacts, children);
+ }
+ else
+ {
+ Assert(XidCSNIsInProgress(
+ pg_atomic_read_u64(&proc->assignedXidCsn)));
+ }
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e..b045ed09f3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/xact.h"
@@ -173,6 +174,7 @@ GetNewTransactionId(bool isSubXact)
* Extend pg_subtrans and pg_commit_ts too.
*/
ExtendCLOG(xid);
+ ExtendCSNLog(xid);
ExtendCommitTs(xid);
ExtendSUBTRANS(xid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa7ea..9321634d60 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
@@ -1435,6 +1436,14 @@ RecordTransactionCommit(void)
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
+
+ /*
+ * Mark our transaction as InDoubt in CsnLog and get ready for
+ * commit.
+ */
+ if (markXidCommitted)
+ CSNSnapshotPrecommit(MyProc, xid, nchildren, children);
+
cleanup:
/* Clean up local data */
if (rels)
@@ -1696,6 +1705,11 @@ RecordTransactionAbort(bool isSubXact)
*/
TransactionIdAbortTree(xid, nchildren, children);
+ /*
+ * Mark our transaction as Aborted in CsnLog.
+ */
+ CSNSnapshotAbort(MyProc, xid, nchildren, children);
+
END_CRIT_SECTION();
/* Compute latestXid while we have the child XIDs handy */
@@ -2185,6 +2199,21 @@ CommitTransaction(void)
*/
ProcArrayEndTransaction(MyProc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CsnLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks.
+ */
+ if (!is_parallel_worker)
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+ TransactionId *subxids;
+ int nsubxids;
+
+ nsubxids = xactGetCommittedChildren(&subxids);
+ CSNSnapshotCommit(MyProc, xid, nsubxids, subxids);
+ }
+
/*
* This is all post-commit cleanup. Note that if an error is raised here,
* it's too late to abort the transaction. This should be just
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0a97b1d37f..8f21e09a03 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
@@ -5345,6 +5346,7 @@ BootStrapXLOG(void)
/* Bootstrap the commit log, too */
BootStrapCLOG();
+ BootStrapCSNLog();
BootStrapCommitTs();
BootStrapSUBTRANS();
BootStrapMultiXact();
@@ -7062,6 +7064,7 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7879,6 +7882,7 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -8527,6 +8531,7 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
ShutdownCLOG();
+ ShutdownCSNLog();
ShutdownCommitTs();
ShutdownSUBTRANS();
ShutdownMultiXact();
@@ -9099,7 +9104,10 @@ CreateCheckPoint(int flags)
* StartupSUBTRANS hasn't been called yet.
*/
if (!RecoveryInProgress())
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update stats before releasing lock. */
LogCheckpointEnd(false);
@@ -9175,6 +9183,7 @@ static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointCLOG();
+ CheckPointCSNLog();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
@@ -9459,7 +9468,10 @@ CreateRestartPoint(int flags)
* this because StartupSUBTRANS hasn't been called yet.
*/
if (EnableHotStandby)
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update before releasing lock. */
LogCheckpointEnd(true);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..7122babfd6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,11 +16,13 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -125,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
size = add_size(size, CLOGShmemSize());
+ size = add_size(size, CSNLogShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
size = add_size(size, TwoPhaseShmemSize());
@@ -143,6 +146,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
+ size = add_size(size, CSNSnapshotShmemSize());
size = add_size(size, SnapMgrShmemSize());
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
@@ -213,6 +217,7 @@ CreateSharedMemoryAndSemaphores(void)
*/
XLOGShmemInit();
CLOGShmemInit();
+ CSNLogShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
@@ -264,6 +269,7 @@ CreateSharedMemoryAndSemaphores(void)
SyncScanShmemInit();
AsyncShmemInit();
+ CSNSnapshotShmemInit();
#ifdef EXEC_BACKEND
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b448533564..d715750437 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,8 @@
#include <signal.h>
#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -352,6 +354,14 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT PREPARED. After lock is released consequent
+ * CSNSnapshotCommit() will write this value to CsnLog.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
else
{
@@ -467,6 +477,16 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT.
+ *
+ * TODO: in case of group commit we can generate one CSNSnapshot for
+ * whole group to save time on timestamp aquisition.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
/*
@@ -833,6 +853,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
{
ExtendSUBTRANS(latestObservedXid);
+ ExtendCSNLog(latestObservedXid);
TransactionIdAdvance(latestObservedXid);
}
TransactionIdRetreat(latestObservedXid); /* = running->nextXid - 1 */
@@ -1511,6 +1532,7 @@ GetSnapshotData(Snapshot snapshot)
int count = 0;
int subcount = 0;
bool suboverflowed = false;
+ XidCSN xid_csn = FrozenXidCSN;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1708,6 +1730,13 @@ GetSnapshotData(Snapshot snapshot)
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
+ /*
+ * Take XidCSN under ProcArrayLock so the snapshot stays
+ * synchronized.
+ */
+ if (enable_csn_snapshot)
+ xid_csn = GenerateCSN(false);
+
LWLockRelease(ProcArrayLock);
/*
@@ -1778,6 +1807,8 @@ GetSnapshotData(Snapshot snapshot)
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
+ snapshot->snapshot_csn = xid_csn;
+
return snapshot;
}
@@ -3335,6 +3366,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
while (TransactionIdPrecedes(next_expected_xid, xid))
{
TransactionIdAdvance(next_expected_xid);
+ ExtendCSNLog(next_expected_xid);
ExtendSUBTRANS(next_expected_xid);
}
Assert(next_expected_xid == xid);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..77b8426e71 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -134,6 +134,8 @@ static const char *const BuiltinTrancheNames[] = {
"CommitTSBuffer",
/* LWTRANCHE_SUBTRANS_BUFFER: */
"SubtransBuffer",
+ /* LWTRANCHE_CSN_LOG_BUFFERS */
+ "CsnLogBuffer",
/* LWTRANCHE_MULTIXACTOFFSET_BUFFER: */
"MultiXactOffsetBuffer",
/* LWTRANCHE_MULTIXACTMEMBER_BUFFER: */
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6985e8eed..3c95ce4aac 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ MultiXactTruncationLock 41
OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
XactTruncationLock 44
+CSNLogControlLock 45
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e57fcd2538..a6b8625ce5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -37,6 +37,7 @@
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -441,6 +442,8 @@ InitProcess(void)
MyProc->clogGroupMemberLsn = InvalidXLogRecPtr;
Assert(pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PGPROCNO);
+ pg_atomic_init_u64(&MyProc->assignedXidCsn, InProgressXidCSN);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..1e9bcc7aee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/csn_snapshot.h"
#include "access/rmgr.h"
#include "access/tableam.h"
#include "access/transam.h"
@@ -1153,6 +1154,15 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_csn_snapshot", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Enable csn-base snapshot."),
+ gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
+ },
+ &enable_csn_snapshot,
+ true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ NULL, NULL, NULL
+ },
{
{"ssl", PGC_SIGHUP, CONN_AUTH_SSL,
gettext_noop("Enables SSL connections."),
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index a0b0458108..679c531622 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
probe clog__checkpoint__done(bool);
probe subtrans__checkpoint__start(bool);
probe subtrans__checkpoint__done(bool);
+ probe csnlog__checkpoint__start(bool);
+ probe csnlog__checkpoint__done(bool);
probe multixact__checkpoint__start(bool);
probe multixact__checkpoint__done(bool);
probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6b6c8571e2..45fe574620 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -229,6 +229,7 @@ static TimestampTz AlignTimestampToMinuteBoundary(TimestampTz ts);
static Snapshot CopySnapshot(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
+static bool XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot);
/*
* Snapshot fields to be serialized.
@@ -247,6 +248,7 @@ typedef struct SerializedSnapshotData
CommandId curcid;
TimestampTz whenTaken;
XLogRecPtr lsn;
+ XidCSN xid_csn;
} SerializedSnapshotData;
Size
@@ -2115,6 +2117,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
serialized_snapshot.curcid = snapshot->curcid;
serialized_snapshot.whenTaken = snapshot->whenTaken;
serialized_snapshot.lsn = snapshot->lsn;
+ serialized_snapshot.xid_csn = snapshot->snapshot_csn;
/*
* Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2189,6 +2192,7 @@ RestoreSnapshot(char *start_address)
snapshot->curcid = serialized_snapshot.curcid;
snapshot->whenTaken = serialized_snapshot.whenTaken;
snapshot->lsn = serialized_snapshot.lsn;
+ snapshot->snapshot_csn = serialized_snapshot.xid_csn;
/* Copy XIDs, if present. */
if (serialized_snapshot.xcnt > 0)
@@ -2229,6 +2233,47 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
/*
* XidInMVCCSnapshot
+ *
+ * Check whether this xid is in snapshot. When enable_csn_snapshot is
+ * switched off just call XidInLocalMVCCSnapshot().
+ */
+bool
+XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ bool in_snapshot;
+
+ in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
+
+ if (!enable_csn_snapshot)
+ {
+ Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
+ return in_snapshot;
+ }
+
+ if (in_snapshot)
+ {
+ /*
+ * This xid may be already in unknown state and in that case
+ * we must wait and recheck.
+ */
+ return XidInvisibleInCSNSnapshot(xid, snapshot);
+ }
+ else
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* Check that csn snapshot gives the same results as local one */
+ if (XidInvisibleInCSNSnapshot(xid, snapshot))
+ {
+ XidCSN gcsn = TransactionIdGetXidCSN(xid);
+ Assert(XidCSNIsAborted(gcsn));
+ }
+#endif
+ return false;
+ }
+}
+
+/*
+ * XidInLocalMVCCSnapshot
* Is the given XID still-in-progress according to the snapshot?
*
* Note: GetSnapshotData never stores either top xid or subxids of our own
@@ -2237,8 +2282,8 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
* TransactionIdIsCurrentTransactionId first, except when it's known the
* XID could not be ours anyway.
*/
-bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+static bool
+XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 786672b1b6..a52c01889d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -220,7 +220,8 @@ static const char *const subdirs[] = {
"pg_xact",
"pg_logical",
"pg_logical/snapshots",
- "pg_logical/mappings"
+ "pg_logical/mappings",
+ "pg_csn"
};
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..9b9611127d
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Commit-Sequence-Number log.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
new file mode 100644
index 0000000000..1894586204
--- /dev/null
+++ b/src/include/access/csn_snapshot.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.h
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_snapshot.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CSN_SNAPSHOT_H
+#define CSN_SNAPSHOT_H
+
+#include "port/atomics.h"
+#include "storage/lock.h"
+#include "utils/snapshot.h"
+#include "utils/guc.h"
+
+/*
+ * snapshot.h is used in frontend code so atomic variant of SnapshotCSN type
+ * is defined here.
+ */
+typedef pg_atomic_uint64 CSN_atomic;
+
+#define InProgressXidCSN UINT64CONST(0x0)
+#define AbortedXidCSN UINT64CONST(0x1)
+#define FrozenXidCSN UINT64CONST(0x2)
+#define InDoubtXidCSN UINT64CONST(0x3)
+#define FirstNormalXidCSN UINT64CONST(0x4)
+
+#define XidCSNIsInProgress(csn) ((csn) == InProgressXidCSN)
+#define XidCSNIsAborted(csn) ((csn) == AbortedXidCSN)
+#define XidCSNIsFrozen(csn) ((csn) == FrozenXidCSN)
+#define XidCSNIsInDoubt(csn) ((csn) == InDoubtXidCSN)
+#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
+
+
+
+
+extern Size CSNSnapshotShmemSize(void);
+extern void CSNSnapshotShmemInit(void);
+
+extern SnapshotCSN GenerateCSN(bool locked);
+
+extern bool XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot);
+
+extern XidCSN TransactionIdGetXidCSN(TransactionId xid);
+
+extern void CSNSnapshotAbort(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+
+#endif /* CSN_SNAPSHOT_H */
diff --git a/src/include/datatype/timestamp.h b/src/include/datatype/timestamp.h
index 6be6d35d1e..583b1beea5 100644
--- a/src/include/datatype/timestamp.h
+++ b/src/include/datatype/timestamp.h
@@ -93,6 +93,9 @@ typedef struct
#define USECS_PER_MINUTE INT64CONST(60000000)
#define USECS_PER_SEC INT64CONST(1000000)
+#define NSECS_PER_SEC INT64CONST(1000000000)
+#define NSECS_PER_USEC INT64CONST(1000)
+
/*
* We allow numeric timezone offsets up to 15:59:59 either way from Greenwich.
* Currently, the record holders for wackiest offsets in actual use are zones
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index f25068fae2..6c3f2c7655 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -280,6 +280,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
#define PG_GETARG_FLOAT4(n) DatumGetFloat4(PG_GETARG_DATUM(n))
#define PG_GETARG_FLOAT8(n) DatumGetFloat8(PG_GETARG_DATUM(n))
#define PG_GETARG_INT64(n) DatumGetInt64(PG_GETARG_DATUM(n))
+#define PG_GETARG_UINT64(n) DatumGetUInt64(PG_GETARG_DATUM(n))
/* use this if you want the raw, possibly-toasted input datum: */
#define PG_GETARG_RAW_VARLENA_P(n) ((struct varlena *) PG_GETARG_POINTER(n))
/* use this if you want the input datum de-toasted: */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..4ac23da654 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -141,6 +141,9 @@ typedef struct timespec instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
#else /* !HAVE_CLOCK_GETTIME */
/* Use gettimeofday() */
@@ -205,6 +208,10 @@ typedef struct timeval instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + \
+ (uint64) (t).tv_usec * (uint64) 1000)
+
#endif /* HAVE_CLOCK_GETTIME */
#else /* WIN32 */
@@ -237,6 +244,9 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
static inline double
GetTimerFrequency(void)
{
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..6188691fb2 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -196,6 +196,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
LWTRANCHE_COMMITTS_BUFFER,
LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_CSN_LOG_BUFFERS,
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
LWTRANCHE_NOTIFY_BUFFER,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b20e2ad4f6..3ff7ea4fce 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,8 +15,10 @@
#define _PROC_H_
#include "access/clog.h"
+#include "access/csn_snapshot.h"
#include "access/xlogdefs.h"
#include "lib/ilist.h"
+#include "utils/snapshot.h"
#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
@@ -210,6 +212,16 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * assignedXidCsn holds XidCSN for this transaction. It is generated
+ * under a ProcArray lock and later is writter to a CSNLog. This
+ * variable defined as atomic only for case of group commit, in all other
+ * scenarios only backend responsible for this proc entry is working with
+ * this variable.
+ */
+ CSN_atomic assignedXidCsn;
+
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63a..9f622c76a7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -121,6 +121,9 @@ typedef enum SnapshotType
typedef struct SnapshotData *Snapshot;
#define InvalidSnapshot ((Snapshot) NULL)
+typedef uint64 XidCSN;
+typedef uint64 SnapshotCSN;
+extern bool enable_csn_snapshot;
/*
* Struct representing all kind of possible snapshots.
@@ -201,6 +204,12 @@ typedef struct SnapshotData
TimestampTz whenTaken; /* timestamp when snapshot was taken */
XLogRecPtr lsn; /* position in the WAL stream when taken */
+
+ /*
+ * SnapshotCSN for snapshot isolation support.
+ * Will be used only if enable_csn_snapshot is enabled.
+ */
+ SnapshotCSN snapshot_csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..da2e5aa38b 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,6 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
+ enable_csn_snapshot | on
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
I find an issue with snapshot switch part of last patch, the xmin_for_csn value is
wrong in TransactionIdGetCSN() function. I try to hold xmin_for_csn in pg_control
and add a UnclearCSN statue for transactionid. And new patches attached.
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Attachments:
0004-globale-snapshot-infrastructure.patchapplication/octet-stream; name=0004-globale-snapshot-infrastructure.patchDownload
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index cedce60a6f..19106cd93a 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -50,10 +50,62 @@ typedef struct
static CSNSnapshotState *csnState;
+
/*
- * Enables this module.
+ * GUC to delay advance of oldestXid for this amount of time. Also determines
+ * the size CSNSnapshotXidMap circular buffer.
*/
-extern bool enable_csn_snapshot;
+int csn_snapshot_defer_time;
+
+
+/*
+ * CSNSnapshotXidMap
+ *
+ * To be able to install csn snapshot that points to past we need to keep
+ * old versions of tuples and therefore delay advance of oldestXid. Here we
+ * keep track of correspondence between snapshot's snapshot_csn and oldestXid
+ * that was set at the time when the snapshot was taken. Much like the
+ * snapshot too old's OldSnapshotControlData does, but with finer granularity
+ * to seconds.
+ *
+ * Different strategies can be employed to hold oldestXid (e.g. we can track
+ * oldest csn-based snapshot among cluster nodes and map it oldestXid
+ * on each node).
+ *
+ * On each snapshot acquisition CSNSnapshotMapXmin() is called and stores
+ * correspondence between current snapshot_csn and oldestXmin in a sparse way:
+ * snapshot_csn is rounded to seconds (and here we use the fact that snapshot_csn
+ * is just a timestamp) and oldestXmin is stored in the circular buffer where
+ * rounded snapshot_csn acts as an offset from current circular buffer head.
+ * Size of the circular buffer is controlled by csn_snapshot_defer_time GUC.
+ *
+ * When csn snapshot arrives we check that its
+ * snapshot_csn is still in our map, otherwise we'll error out with "snapshot too
+ * old" message. If snapshot_csn is successfully mapped to oldestXid we move
+ * backend's pgxact->xmin to proc->originalXmin and fill pgxact->xmin to
+ * mapped oldestXid. That way GetOldestXmin() can take into account backends
+ * with imported csn snapshot and old tuple versions will be preserved.
+ *
+ * Also while calculating oldestXmin for our map in presence of imported
+ * csn snapshots we should use proc->originalXmin instead of pgxact->xmin
+ * that was set during import. Otherwise, we can create a feedback loop:
+ * xmin's of imported csn snapshots were calculated using our map and new
+ * entries in map going to be calculated based on that xmin's, and there is
+ * a risk to stuck forever with one non-increasing oldestXmin. All other
+ * callers of GetOldestXmin() are using pgxact->xmin so the old tuple versions
+ * are preserved.
+ */
+typedef struct CSNSnapshotXidMap
+{
+ int head; /* offset of current freshest value */
+ int size; /* total size of circular buffer */
+ CSN_atomic last_csn_seconds; /* last rounded csn that changed
+ * xmin_by_second[] */
+ TransactionId *xmin_by_second; /* circular buffer of oldestXmin's */
+}
+CSNSnapshotXidMap;
+
+static CSNSnapshotXidMap *csnXidMap;
/* Estimate shared memory space needed */
@@ -64,25 +116,249 @@ CSNSnapshotShmemSize(void)
size += MAXALIGN(sizeof(CSNSnapshotState));
+ if (csn_snapshot_defer_time > 0)
+ {
+ size += sizeof(CSNSnapshotXidMap);
+ size += csn_snapshot_defer_time*sizeof(TransactionId);
+ size = MAXALIGN(size);
+ }
+
return size;
}
/* Init shared memory structures */
void
-CSNSnapshotShmemInit()
+CSNSnapshotShmemInit(void)
{
bool found;
- csnState = ShmemInitStruct("csnState",
- sizeof(CSNSnapshotState),
- &found);
- if (!found)
+ if (true)
+ {
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
+ {
+ csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
+ csnState->xmin_for_csn = InvalidTransactionId;
+ SpinLockInit(&csnState->lock);
+ }
+ }
+
+ if (csn_snapshot_defer_time > 0)
+ {
+ csnXidMap = ShmemInitStruct("csnXidMap",
+ sizeof(CSNSnapshotXidMap),
+ &found);
+ if (!found)
+ {
+ int i;
+
+ pg_atomic_init_u64(&csnXidMap->last_csn_seconds, 0);
+ csnXidMap->head = 0;
+ csnXidMap->size = csn_snapshot_defer_time;
+ csnXidMap->xmin_by_second =
+ ShmemAlloc(sizeof(TransactionId)*csnXidMap->size);
+
+ for (i = 0; i < csnXidMap->size; i++)
+ csnXidMap->xmin_by_second[i] = InvalidTransactionId;
+ }
+ }
+}
+
+/*
+ * CSNSnapshotStartup
+ *
+ * Set csnXidMap entries to oldestActiveXID during startup.
+ */
+void
+CSNSnapshotStartup(TransactionId oldestActiveXID)
+{
+ /*
+ * Run only if we have initialized shared memory and csnXidMap
+ * is enabled.
+ */
+ if (IsNormalProcessingMode() &&
+ enable_csn_snapshot && csn_snapshot_defer_time > 0)
+ {
+ int i;
+
+ Assert(TransactionIdIsValid(oldestActiveXID));
+ for (i = 0; i < csnXidMap->size; i++)
+ csnXidMap->xmin_by_second[i] = oldestActiveXID;
+ ProcArraySetCSNSnapshotXmin(oldestActiveXID);
+ }
+}
+
+/*
+ * CSNSnapshotMapXmin
+ *
+ * Maintain circular buffer of oldestXmins for several seconds in past. This
+ * buffer allows to shift oldestXmin in the past when backend is importing
+ * CSN snapshot. Otherwise old versions of tuples that were needed for
+ * this transaction can be recycled by other processes (vacuum, HOT, etc).
+ *
+ * Locking here is not trivial. Called upon each snapshot creation after
+ * ProcArrayLock is released. Such usage creates several race conditions. It
+ * is possible that backend who got csn called CSNSnapshotMapXmin()
+ * only after other backends managed to get snapshot and complete
+ * CSNSnapshotMapXmin() call, or even committed. This is safe because
+ *
+ * * We already hold our xmin in MyPgXact, so our snapshot will not be
+ * harmed even though ProcArrayLock is released.
+ *
+ * * snapshot_csn is always pessmistically rounded up to the next
+ * second.
+ *
+ * * For performance reasons, xmin value for particular second is filled
+ * only once. Because of that instead of writing to buffer just our
+ * xmin (which is enough for our snapshot), we bump oldestXmin there --
+ * it mitigates the possibility of damaging someone else's snapshot by
+ * writing to the buffer too advanced value in case of slowness of
+ * another backend who generated csn earlier, but didn't manage to
+ * insert it before us.
+ *
+ * * if CSNSnapshotMapXmin() founds a gap in several seconds between
+ * current call and latest completed call then it should fill that gap
+ * with latest known values instead of new one. Otherwise it is
+ * possible (however highly unlikely) that this gap also happend
+ * between taking snapshot and call to CSNSnapshotMapXmin() for some
+ * backend. And we are at risk to fill circullar buffer with
+ * oldestXmin's that are bigger then they actually were.
+ */
+void
+CSNSnapshotMapXmin(SnapshotCSN snapshot_csn)
+{
+ int offset, gap, i;
+ SnapshotCSN csn_seconds;
+ SnapshotCSN last_csn_seconds;
+ volatile TransactionId oldest_deferred_xmin;
+ TransactionId current_oldest_xmin, previous_oldest_xmin;
+
+ /* Callers should check config values */
+ Assert(csn_snapshot_defer_time > 0);
+ Assert(csnXidMap != NULL);
+ /*
+ * Round up snapshot_csn to the next second -- pessimistically and safely.
+ */
+ csn_seconds = (snapshot_csn / NSECS_PER_SEC + 1);
+
+ /*
+ * Fast-path check. Avoid taking exclusive CSNSnapshotXidMapLock lock
+ * if oldestXid was already written to xmin_by_second[] for this rounded
+ * snapshot_csn.
+ */
+ if (pg_atomic_read_u64(&csnXidMap->last_csn_seconds) >= csn_seconds)
+ return;
+
+ /* Ok, we have new entry (or entries) */
+ LWLockAcquire(CSNSnapshotXidMapLock, LW_EXCLUSIVE);
+
+ /* Re-check last_csn_seconds under lock */
+ last_csn_seconds = pg_atomic_read_u64(&csnXidMap->last_csn_seconds);
+ if (last_csn_seconds >= csn_seconds)
+ {
+ LWLockRelease(CSNSnapshotXidMapLock);
+ return;
+ }
+ pg_atomic_write_u64(&csnXidMap->last_csn_seconds, csn_seconds);
+
+ /*
+ * Count oldest_xmin.
+ *
+ * It was possible to calculate oldest_xmin during corresponding snapshot
+ * creation, but GetSnapshotData() intentionally reads only PgXact, but not
+ * PgProc. And we need info about originalXmin (see comment to csnXidMap)
+ * which is stored in PgProc because of threats in comments around PgXact
+ * about extending it with new fields. So just calculate oldest_xmin again,
+ * that anyway happens quite rarely.
+ */
+ current_oldest_xmin = GetOldestXmin(NULL, PROCARRAY_NON_IMPORTED_XMIN);
+
+ previous_oldest_xmin = csnXidMap->xmin_by_second[csnXidMap->head];
+
+ Assert(TransactionIdIsNormal(current_oldest_xmin));
+ Assert(TransactionIdIsNormal(previous_oldest_xmin) || !enable_csn_snapshot);
+
+ gap = csn_seconds - last_csn_seconds;
+ offset = csn_seconds % csnXidMap->size;
+
+ /* Sanity check before we update head and gap */
+ Assert( gap >= 1 );
+ Assert( (csnXidMap->head + gap) % csnXidMap->size == offset );
+
+ gap = gap > csnXidMap->size ? csnXidMap->size : gap;
+ csnXidMap->head = offset;
+
+ /* Fill new entry with current_oldest_xmin */
+ csnXidMap->xmin_by_second[offset] = current_oldest_xmin;
+
+ /*
+ * If we have gap then fill it with previous_oldest_xmin for reasons
+ * outlined in comment above this function.
+ */
+ for (i = 1; i < gap; i++)
+ {
+ offset = (offset + csnXidMap->size - 1) % csnXidMap->size;
+ csnXidMap->xmin_by_second[offset] = previous_oldest_xmin;
+ }
+
+ oldest_deferred_xmin =
+ csnXidMap->xmin_by_second[ (csnXidMap->head + 1) % csnXidMap->size ];
+
+ LWLockRelease(CSNSnapshotXidMapLock);
+
+ /*
+ * Advance procArray->csn_snapshot_xmin after we released
+ * CSNSnapshotXidMapLock. Since we gather not xmin but oldestXmin, it
+ * never goes backwards regardless of how slow we can do that.
+ */
+ Assert(TransactionIdFollowsOrEquals(oldest_deferred_xmin,
+ ProcArrayGetCSNSnapshotXmin()));
+ ProcArraySetCSNSnapshotXmin(oldest_deferred_xmin);
+}
+
+/*
+ * CSNSnapshotToXmin
+ *
+ * Get oldestXmin that took place when snapshot_csn was taken.
+ */
+TransactionId
+CSNSnapshotToXmin(SnapshotCSN snapshot_csn)
+{
+ TransactionId xmin;
+ SnapshotCSN csn_seconds;
+ volatile SnapshotCSN last_csn_seconds;
+
+ /* Callers should check config values */
+ Assert(csn_snapshot_defer_time > 0);
+ Assert(csnXidMap != NULL);
+
+ /* Round down to get conservative estimates */
+ csn_seconds = (snapshot_csn / NSECS_PER_SEC);
+
+ LWLockAcquire(CSNSnapshotXidMapLock, LW_SHARED);
+ last_csn_seconds = pg_atomic_read_u64(&csnXidMap->last_csn_seconds);
+ if (csn_seconds > last_csn_seconds)
+ {
+ /* we don't have entry for this snapshot_csn yet, return latest known */
+ xmin = csnXidMap->xmin_by_second[csnXidMap->head];
+ }
+ else if (last_csn_seconds - csn_seconds < csnXidMap->size)
{
- csnState->last_max_csn = 0;
- csnState->last_csn_log_wal = 0;
- csnState->xmin_for_csn = InvalidTransactionId;
- SpinLockInit(&csnState->lock);
+ /* we are good, retrieve value from our map */
+ Assert(last_csn_seconds % csnXidMap->size == csnXidMap->head);
+ xmin = csnXidMap->xmin_by_second[csn_seconds % csnXidMap->size];
}
+ else
+ {
+ /* requested snapshot_csn is too old, let caller know */
+ xmin = InvalidTransactionId;
+ }
+ LWLockRelease(CSNSnapshotXidMapLock);
+
+ return xmin;
}
/*
@@ -99,7 +375,7 @@ GenerateCSN(bool locked)
instr_time current_time;
SnapshotCSN csn;
- Assert(get_csnlog_status());
+ Assert(get_csnlog_status() || csn_snapshot_defer_time > 0);
/*
* TODO: create some macro that add small random shift to current time.
@@ -124,6 +400,125 @@ GenerateCSN(bool locked)
return csn;
}
+/*
+ * CSNSnapshotPrepareCurrent
+ *
+ * Set InDoubt state for currently active transaction and return commit's
+ * global snapshot.
+ */
+SnapshotCSN
+CSNSnapshotPrepareCurrent(void)
+{
+ TransactionId xid = GetCurrentTransactionIdIfAny();
+
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (TransactionIdIsValid(xid))
+ {
+ TransactionId *subxids;
+ int nsubxids = xactGetCommittedChildren(&subxids);
+ CSNLogSetCSN(xid, nsubxids, subxids, InDoubtXidCSN, true);
+ }
+
+ /* Nothing to write if we don't heve xid */
+
+ return GenerateCSN(false);
+}
+
+
+/*
+ * CSNSnapshotAssignCsnCurrent
+ *
+ * Asign SnapshotCSN for currently active transaction. SnapshotCSN is supposedly
+ * maximal among of values returned by CSNSnapshotPrepareCurrent and
+ * pg_global_snapshot_prepare.
+ */
+void
+CSNSnapshotAssignCsnCurrent(SnapshotCSN snapshot_csn)
+{
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (!XidCSNIsNormal(snapshot_csn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("pg_global_snapshot_assign expects normal snapshot_csn")));
+
+ /* Skip emtpty transactions */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return;
+
+ /* Set global_csn and defuse ProcArrayEndTransaction from assigning one */
+ pg_atomic_write_u64(&MyProc->assignedXidCsn, snapshot_csn);
+}
+
+/*
+ * CSNSnapshotSync
+ *
+ * Due to time desynchronization on different nodes we can receive snapshot_csn
+ * which is greater than snapshot_csn on this node. To preserve proper isolation
+ * this node needs to wait when such snapshot_csn comes on local clock.
+ *
+ * This should happend relatively rare if nodes have running NTP/PTP/etc.
+ * Complain if wait time is more than SNAP_SYNC_COMPLAIN.
+ */
+void
+CSNSnapshotSync(SnapshotCSN remote_csn)
+{
+ SnapshotCSN local_csn;
+ SnapshotCSN delta;
+
+ Assert(enable_csn_snapshot);
+
+ for(;;)
+ {
+ SpinLockAcquire(&csnState->lock);
+ if (csnState->last_max_csn > remote_csn)
+ {
+ /* Everything is fine */
+ SpinLockRelease(&csnState->lock);
+ return;
+ }
+ else if ((local_csn = GenerateCSN(true)) >= remote_csn)
+ {
+ /*
+ * Everything is fine too, but last_max_csn wasn't updated for
+ * some time.
+ */
+ SpinLockRelease(&csnState->lock);
+ return;
+ }
+ SpinLockRelease(&csnState->lock);
+
+ /* Okay we need to sleep now */
+ delta = remote_csn - local_csn;
+ if (delta > SNAP_DESYNC_COMPLAIN)
+ ereport(WARNING,
+ (errmsg("remote global snapshot exceeds ours by more than a second"),
+ errhint("Consider running NTPd on servers participating in global transaction")));
+
+ /* TODO: report this sleeptime somewhere? */
+ pg_usleep((long) (delta/NSECS_PER_USEC));
+
+ /*
+ * Loop that checks to ensure that we actually slept for specified
+ * amount of time.
+ */
+ }
+
+ Assert(false); /* Should not happend */
+ return;
+}
+
/*
* TransactionIdGetXidCSN
*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 57bda5d422..7f90520beb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2469,3 +2469,128 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
RemoveTwoPhaseFile(xid, giveWarning);
RemoveGXact(gxact);
}
+
+/*
+ * CSNSnapshotPrepareTwophase
+ *
+ * Set InDoubt state for currently active transaction and return commit's
+ * global snapshot.
+ */
+static SnapshotCSN
+CSNSnapshotPrepareTwophase(const char *gid)
+{
+ GlobalTransaction gxact;
+ PGXACT *pgxact;
+ char *buf;
+ TransactionId xid;
+ xl_xact_parsed_prepare parsed;
+
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ /*
+ * Validate the GID, and lock the GXACT to ensure that two backends do not
+ * try to access the same GID at once.
+ */
+ gxact = LockGXact(gid, GetUserId());
+ pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+ xid = pgxact->xid;
+
+ if (gxact->ondisk)
+ buf = ReadTwoPhaseFile(xid, true);
+ else
+ XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+
+ ParsePrepareRecord(0, (xl_xact_prepare *)buf, &parsed);
+
+ CSNLogSetCSN(xid, parsed.nsubxacts,
+ parsed.subxacts, InDoubtXidCSN, true);
+
+ /* Unlock our GXACT */
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+ gxact->locking_backend = InvalidBackendId;
+ LWLockRelease(TwoPhaseStateLock);
+
+ pfree(buf);
+
+ return GenerateCSN(false);
+}
+
+/*
+ * TwoPhaseAssignGlobalCsn
+ *
+ * Asign SnapshotCSN for currently active transaction. SnapshotCSN is supposedly
+ * maximal among of values returned by CSNSnapshotPrepareCurrent and
+ * pg_global_snapshot_prepare.
+ *
+ * This function is a counterpart of GlobalSnapshotAssignCsnCurrent() for
+ * twophase transactions.
+ */
+static void
+CSNSnapshotAssignCsnTwoPhase(const char *gid, SnapshotCSN snapshot_csn)
+{
+ GlobalTransaction gxact;
+ PGPROC *proc;
+
+ if (!enable_csn_snapshot)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not prepare transaction for global commit"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (!XidCSNIsNormal(snapshot_csn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("pg_csn_snapshot_assign expects normal snapshot_csn")));
+
+ /*
+ * Validate the GID, and lock the GXACT to ensure that two backends do not
+ * try to access the same GID at once.
+ */
+ gxact = LockGXact(gid, GetUserId());
+ proc = &ProcGlobal->allProcs[gxact->pgprocno];
+
+ /* Set snapshot_csn and defuse ProcArrayRemove from assigning one. */
+ pg_atomic_write_u64(&proc->assignedXidCsn, snapshot_csn);
+
+ /* Unlock our GXACT */
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+ gxact->locking_backend = InvalidBackendId;
+ LWLockRelease(TwoPhaseStateLock);
+}
+
+/*
+ * SQL interface to CSNSnapshotPrepareTwophase()
+ *
+ * TODO: Rewrite this as PREPARE TRANSACTION 'gid' RETURNING SNAPSHOT
+ */
+Datum
+pg_csn_snapshot_prepare(PG_FUNCTION_ARGS)
+{
+ const char *gid = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ SnapshotCSN snapshot_csn;
+
+ snapshot_csn = CSNSnapshotPrepareTwophase(gid);
+
+ PG_RETURN_INT64(snapshot_csn);
+}
+
+/*
+ * SQL interface to CSNSnapshotAssignCsnTwoPhase()
+ *
+ * TODO: Rewrite this as COMMIT PREPARED 'gid' SNAPSHOT 'global_csn'
+ */
+Datum
+pg_csn_snapshot_assign(PG_FUNCTION_ARGS)
+{
+ const char *gid = text_to_cstring(PG_GETARG_TEXT_PP(0));
+ SnapshotCSN snapshot_csn = PG_GETARG_INT64(1);
+
+ CSNSnapshotAssignCsnTwoPhase(gid, snapshot_csn);
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 32f1e614b4..0a1c7d8615 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7071,6 +7071,7 @@ StartupXLOG(void)
*/
StartupCLOG();
StartupSUBTRANS(oldestActiveXID);
+ CSNSnapshotStartup(oldestActiveXID);
/*
* If we're beginning at a shutdown checkpoint, we know that
@@ -7888,6 +7889,7 @@ StartupXLOG(void)
{
StartupCLOG();
StartupSUBTRANS(oldestActiveXID);
+ CSNSnapshotStartup(oldestActiveXID);
}
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e326b431c2..b36a85cd01 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -96,6 +96,8 @@ typedef struct ProcArrayStruct
TransactionId replication_slot_xmin;
/* oldest catalog xmin of any replication slot */
TransactionId replication_slot_catalog_xmin;
+ /* xmin of oldest active csn snapshot */
+ TransactionId csn_snapshot_xmin;
/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
int pgprocnos[FLEXIBLE_ARRAY_MEMBER];
@@ -251,6 +253,7 @@ CreateSharedProcArray(void)
procArray->lastOverflowedXid = InvalidTransactionId;
procArray->replication_slot_xmin = InvalidTransactionId;
procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+ procArray->csn_snapshot_xmin = InvalidTransactionId;
}
allProcs = ProcGlobal->allProcs;
@@ -442,6 +445,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->originalXmin = InvalidTransactionId;
+
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
proc->delayChkpt = false; /* be sure this is cleared in abort */
@@ -464,6 +469,8 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
pgxact->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->originalXmin = InvalidTransactionId;
+
/* must be cleared with xid/xmin: */
pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
proc->delayChkpt = false; /* be sure this is cleared in abort */
@@ -630,6 +637,7 @@ ProcArrayClearTransaction(PGPROC *proc)
pgxact->xid = InvalidTransactionId;
proc->lxid = InvalidLocalTransactionId;
pgxact->xmin = InvalidTransactionId;
+ proc->originalXmin = InvalidTransactionId;
proc->recoveryConflictPending = false;
/* redundant, but just in case */
@@ -1332,6 +1340,7 @@ GetOldestXmin(Relation rel, int flags)
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+ TransactionId csn_snapshot_xmin = InvalidTransactionId;
/*
* If we're not computing a relation specific limit, or if a shared
@@ -1370,6 +1379,7 @@ GetOldestXmin(Relation rel, int flags)
{
/* Fetch xid just once - see GetNewTransactionId */
TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
+ TransactionId original_xmin = UINT32_ACCESS_ONCE(proc->originalXmin);
/* First consider the transaction's own Xid, if any */
if (TransactionIdIsNormal(xid) &&
@@ -1382,8 +1392,17 @@ GetOldestXmin(Relation rel, int flags)
* We must check both Xid and Xmin because a transaction might
* have an Xmin but not (yet) an Xid; conversely, if it has an
* Xid, that could determine some not-yet-set Xmin.
+ *
+ * In case of oldestXmin calculation for CSNSnapshotMapXmin()
+ * pgxact->xmin should be changed to proc->originalXmin. Details
+ * in commets to CSNSnapshotMapXmin.
*/
- xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+ if ((flags & PROCARRAY_NON_IMPORTED_XMIN) &&
+ TransactionIdIsValid(original_xmin))
+ xid = original_xmin;
+ else
+ xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+
if (TransactionIdIsNormal(xid) &&
TransactionIdPrecedes(xid, result))
result = xid;
@@ -1397,6 +1416,7 @@ GetOldestXmin(Relation rel, int flags)
*/
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+ csn_snapshot_xmin = procArray->csn_snapshot_xmin;
if (RecoveryInProgress())
{
@@ -1438,6 +1458,11 @@ GetOldestXmin(Relation rel, int flags)
result = FirstNormalTransactionId;
}
+ if (!(flags & PROCARRAY_NON_IMPORTED_XMIN) &&
+ TransactionIdIsValid(csn_snapshot_xmin) &&
+ NormalTransactionIdPrecedes(csn_snapshot_xmin, result))
+ result = csn_snapshot_xmin;
+
/*
* Check whether there are replication slots requiring an older xmin.
*/
@@ -1535,6 +1560,7 @@ GetSnapshotData(Snapshot snapshot)
XidCSN xid_csn = FrozenXidCSN;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+ TransactionId csn_snapshot_xmin = InvalidTransactionId;
Assert(snapshot != NULL);
@@ -1726,6 +1752,7 @@ GetSnapshotData(Snapshot snapshot)
*/
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+ csn_snapshot_xmin = procArray->csn_snapshot_xmin;
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
@@ -1752,6 +1779,10 @@ GetSnapshotData(Snapshot snapshot)
if (!TransactionIdIsNormal(RecentGlobalXmin))
RecentGlobalXmin = FirstNormalTransactionId;
+ if (TransactionIdIsValid(csn_snapshot_xmin) &&
+ TransactionIdPrecedes(csn_snapshot_xmin, RecentGlobalXmin))
+ RecentGlobalXmin = csn_snapshot_xmin;
+
/* Check whether there's a replication slot requiring an older xmin. */
if (TransactionIdIsValid(replication_slot_xmin) &&
NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
@@ -1807,7 +1838,10 @@ GetSnapshotData(Snapshot snapshot)
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
+ snapshot->imported_snapshot_csn = false;
snapshot->snapshot_csn = xid_csn;
+ if (csn_snapshot_defer_time > 0 && IsUnderPostmaster)
+ CSNSnapshotMapXmin(snapshot->snapshot_csn);
return snapshot;
}
@@ -3156,6 +3190,24 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
LWLockRelease(ProcArrayLock);
}
+/*
+ * ProcArraySetCSNSnapshotXmin
+ */
+void
+ProcArraySetCSNSnapshotXmin(TransactionId xmin)
+{
+ /* We rely on atomic fetch/store of xid */
+ procArray->csn_snapshot_xmin = xmin;
+}
+
+/*
+ * ProcArrayGetCSNSnapshotXmin
+ */
+TransactionId
+ProcArrayGetCSNSnapshotXmin(void)
+{
+ return procArray->csn_snapshot_xmin;
+}
#define XidCacheRemove(i) \
do { \
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 3c95ce4aac..e048a2276d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -51,3 +51,4 @@ OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
XactTruncationLock 44
CSNLogControlLock 45
+CSNSnapshotXidMapLock 46
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2a31366930..2bfafa69c1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2394,3 +2394,98 @@ XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
return false;
}
+
+
+/*
+ * ExportCSNSnapshot
+ *
+ * Export snapshot_csn so that caller can expand this transaction to other
+ * nodes.
+ *
+ * TODO: it's better to do this through EXPORT/IMPORT SNAPSHOT syntax and
+ * add some additional checks that transaction did not yet acquired xid, but
+ * for current iteration of this patch I don't want to hack on parser.
+ */
+SnapshotCSN
+ExportCSNSnapshot()
+{
+ if (!get_csnlog_status())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not export csn snapshot"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ return CurrentSnapshot->snapshot_csn;
+}
+
+/* SQL accessor to ExportCSNSnapshot() */
+Datum
+pg_csn_snapshot_export(PG_FUNCTION_ARGS)
+{
+ SnapshotCSN export_csn = ExportCSNSnapshot();
+ PG_RETURN_UINT64(export_csn);
+}
+
+/*
+ * ImportCSNSnapshot
+ *
+ * Import csn and retract this backends xmin to the value that was
+ * actual when we had such csn.
+ *
+ * TODO: it's better to do this through EXPORT/IMPORT SNAPSHOT syntax and
+ * add some additional checks that transaction did not yet acquired xid, but
+ * for current iteration of this patch I don't want to hack on parser.
+ */
+void
+ImportCSNSnapshot(SnapshotCSN snapshot_csn)
+{
+ volatile TransactionId xmin;
+
+ if (!get_csnlog_status())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not import csn snapshot"),
+ errhint("Make sure the configuration parameter \"%s\" is enabled.",
+ "enable_csn_snapshot")));
+
+ if (csn_snapshot_defer_time <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("could not import csn snapshot"),
+ errhint("Make sure the configuration parameter \"%s\" is positive.",
+ "csn_snapshot_defer_time")));
+
+ /*
+ * Call CSNSnapshotToXmin under ProcArrayLock to avoid situation that
+ * resulting xmin will be evicted from map before we will set it into our
+ * backend's xmin.
+ */
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+ xmin = CSNSnapshotToXmin(snapshot_csn);
+ if (!TransactionIdIsValid(xmin))
+ {
+ LWLockRelease(ProcArrayLock);
+ elog(ERROR, "CSNSnapshotToXmin: csn snapshot too old");
+ }
+ MyProc->originalXmin = MyPgXact->xmin;
+ MyPgXact->xmin = TransactionXmin = xmin;
+ LWLockRelease(ProcArrayLock);
+
+ CurrentSnapshot->xmin = xmin; /* defuse SnapshotResetXmin() */
+ CurrentSnapshot->snapshot_csn = snapshot_csn;
+ CurrentSnapshot->imported_snapshot_csn = true;
+ CSNSnapshotSync(snapshot_csn);
+
+ Assert(TransactionIdPrecedesOrEquals(RecentGlobalXmin, xmin));
+ Assert(TransactionIdPrecedesOrEquals(RecentGlobalDataXmin, xmin));
+}
+
+/* SQL accessor to ImportCSNSnapshot() */
+Datum
+pg_csn_snapshot_import(PG_FUNCTION_ARGS)
+{
+ SnapshotCSN snapshot_csn = PG_GETARG_UINT64(0);
+ ImportCSNSnapshot(snapshot_csn);
+ PG_RETURN_VOID();
+}
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
index a768f054f5..92bc9c77bf 100644
--- a/src/include/access/csn_snapshot.h
+++ b/src/include/access/csn_snapshot.h
@@ -38,11 +38,13 @@ typedef pg_atomic_uint64 CSN_atomic;
#define CSNIsUnclear(csn) ((csn) == UnclearCSN)
#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
-
-
+extern int csn_snapshot_defer_time;
extern Size CSNSnapshotShmemSize(void);
extern void CSNSnapshotShmemInit(void);
+extern void CSNSnapshotStartup(TransactionId oldestActiveXID);
+extern void CSNSnapshotMapXmin(SnapshotCSN snapshot_csn);
+extern TransactionId CSNSnapshotToXmin(SnapshotCSN snapshot_csn);
extern SnapshotCSN GenerateCSN(bool locked);
@@ -56,5 +58,8 @@ extern void CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
TransactionId *subxids);
extern void CSNSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
TransactionId *subxids);
+extern void CSNSnapshotAssignCsnCurrent(SnapshotCSN snapshot_csn);
+extern SnapshotCSN CSNSnapshotPrepareCurrent(void);
+extern void CSNSnapshotSync(SnapshotCSN remote_csn);
#endif /* CSN_SNAPSHOT_H */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 95604e988a..17e85486ae 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10953,4 +10953,18 @@
proname => 'is_normalized', prorettype => 'bool', proargtypes => 'text text',
prosrc => 'unicode_is_normalized' },
+# csn shnapshot handling
+{ oid => '4179', descr => 'export csn snapshot',
+ proname => 'pg_csn_snapshot_export', provolatile => 'v', proparallel => 'u',
+ prorettype => 'int8', proargtypes => '', prosrc => 'pg_csn_snapshot_export' },
+{ oid => '4180', descr => 'import csn snapshot',
+ proname => 'pg_csn_snapshot_import', provolatile => 'v', proparallel => 'u',
+ prorettype => 'void', proargtypes => 'int8', prosrc => 'pg_csn_snapshot_import' },
+{ oid => '4198', descr => 'prepare distributed transaction for commit, get global_csn',
+ proname => 'pg_csn_snapshot_prepare', provolatile => 'v', proparallel => 'u',
+ prorettype => 'int8', proargtypes => 'text', prosrc => 'pg_csn_snapshot_prepare' },
+{ oid => '4199', descr => 'assign global_csn to distributed transaction',
+ proname => 'pg_csn_snapshot_assign', provolatile => 'v', proparallel => 'u',
+ prorettype => 'void', proargtypes => 'text int8', prosrc => 'pg_csn_snapshot_assign' },
+
]
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3ff7ea4fce..30bcbbfe15 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -222,6 +222,8 @@ struct PGPROC
*/
CSN_atomic assignedXidCsn;
+ /* Original xmin of this backend before csn snapshot was imported */
+ TransactionId originalXmin;
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c064..35dc1dcc40 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -36,6 +36,9 @@
#define PROCARRAY_SLOTS_XMIN 0x20 /* replication slot xmin,
* catalog_xmin */
+#define PROCARRAY_NON_IMPORTED_XMIN 0x80 /* use originalXmin instead
+ * of xmin to properly
+ * maintain csnXidMap */
/*
* Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
* PGXACT->vacuumFlags. Other flags are used for different purposes and
@@ -125,4 +128,6 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
TransactionId *catalog_xmin);
+extern void ProcArraySetCSNSnapshotXmin(TransactionId xmin);
+extern TransactionId ProcArrayGetCSNSnapshotXmin(void);
#endif /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index ffb4ba3adf..0e37ebad07 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -127,6 +127,8 @@ extern void AtSubCommit_Snapshot(int level);
extern void AtSubAbort_Snapshot(int level);
extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
+extern SnapshotCSN ExportCSNSnapshot(void);
+extern void ImportCSNSnapshot(SnapshotCSN snapshot_csn);
extern void ImportSnapshot(const char *idstr);
extern bool XactHasExportedSnapshots(void);
extern void DeleteAllExportedSnapshotFiles(void);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 9f622c76a7..2eef33c4b6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -210,6 +210,8 @@ typedef struct SnapshotData
* Will be used only if enable_csn_snapshot is enabled.
*/
SnapshotCSN snapshot_csn;
+ /* Did we have our own snapshot_csn or imported one from different node */
+ bool imported_snapshot_csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
0001-CSN-base-snapshot.patchapplication/octet-stream; name=0001-CSN-base-snapshot.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..fc0321ee6b 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,8 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
clog.o \
commit_ts.o \
+ csn_log.o \
+ csn_snapshot.o \
generic_xlog.o \
multixact.o \
parallel.o \
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
new file mode 100644
index 0000000000..4e0b8d64e4
--- /dev/null
+++ b/src/backend/access/transam/csn_log.c
@@ -0,0 +1,438 @@
+/*-----------------------------------------------------------------------------
+ *
+ * csn_log.c
+ * Track commit sequence numbers of finished transactions
+ *
+ * This module provides SLRU to store CSN for each transaction. This
+ * mapping need to be kept only for xid's greater then oldestXid, but
+ * that can require arbitrary large amounts of memory in case of long-lived
+ * transactions. Because of same lifetime and persistancy requirements
+ * this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool enable_csn_snapshot;
+
+/*
+ * Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/CSN_LOG_XACTS_PER_PAGE, and CSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCSNLog (see CSNLogPagePrecedes).
+ */
+
+/* We store the commit CSN for each xid */
+#define CSN_LOG_XACTS_PER_PAGE (BLCKSZ / sizeof(XidCSN))
+
+#define TransactionIdToPage(xid) ((xid) / (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CSN_LOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CSNLogCtlData;
+#define CsnlogCtl (&CSNLogCtlData)
+
+static int ZeroCSNLogPage(int pageno);
+static bool CSNLogPagePrecedes(int page1, int page2);
+static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno);
+static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
+ int slotno);
+
+/*
+ * CSNLogSetCSN
+ *
+ * Record XidCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should be
+ * AbortedCSN for abort cases.
+ */
+void
+CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ int pageno;
+ int i = 0;
+ int offset = 0;
+
+ /* Callers of CSNLogSetCSN() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ Assert(TransactionIdIsValid(xid));
+
+ pageno = TransactionIdToPage(xid); /* get page of parent */
+ for (;;)
+ {
+ int num_on_page = 0;
+
+ while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+ {
+ num_on_page++;
+ i++;
+ }
+
+ CSNLogSetPageStatus(xid,
+ num_on_page, subxids + offset,
+ csn, pageno);
+ if (i >= nsubxids)
+ break;
+
+ offset = i;
+ pageno = TransactionIdToPage(subxids[offset]);
+ xid = InvalidTransactionId;
+ }
+}
+
+/*
+ * Record the final state of transaction entries in the csn log for
+ * all entries on a single page. Atomic only on this page.
+ *
+ * Otherwise API is same as TransactionIdSetTreeStatus()
+ */
+static void
+CSNLogSetPageStatus(TransactionId xid, int nsubxids,
+ TransactionId *subxids,
+ XidCSN csn, int pageno)
+{
+ int slotno;
+ int i;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ slotno = SimpleLruReadPage(CsnlogCtl, pageno, true, xid);
+
+ /* Subtransactions first, if needed ... */
+ for (i = 0; i < nsubxids; i++)
+ {
+ Assert(CsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+ CSNLogSetCSNInSlot(subxids[i], csn, slotno);
+ }
+
+ /* ... then the main transaction */
+ if (TransactionIdIsValid(xid))
+ CSNLogSetCSNInSlot(xid, csn, slotno);
+
+ CsnlogCtl->shared->page_dirty[slotno] = true;
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
+{
+ int entryno = TransactionIdToPgIndex(xid);
+ XidCSN *ptr;
+
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+ *ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetXidCSN() in csn_snapshot.c is the
+ * intended caller.
+ */
+XidCSN
+CSNLogGetCSNByXid(TransactionId xid)
+{
+ int pageno = TransactionIdToPage(xid);
+ int entryno = TransactionIdToPgIndex(xid);
+ int slotno;
+ XidCSN *ptr;
+ XidCSN xid_csn;
+
+ /* Callers of CSNLogGetCSNByXid() must check GUC params */
+ Assert(enable_csn_snapshot);
+
+ /* Can't ask about stuff that might not be around anymore */
+ Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+ slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+ xid_csn = *ptr;
+
+ LWLockRelease(CSNLogControlLock);
+
+ return xid_csn;
+}
+
+/*
+ * Number of shared CSNLog buffers.
+ */
+static Size
+CSNLogShmemBuffers(void)
+{
+ return Min(32, Max(4, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for CsnlogCtl.
+ */
+Size
+CSNLogShmemSize(void)
+{
+ if (!enable_csn_snapshot)
+ return 0;
+
+ return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for CSNLog.
+ */
+void
+CSNLogShmemInit(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
+ SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
+ CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+}
+
+/*
+ * This func must be called ONCE on system install. It creates the initial
+ * CSNLog segment. The pg_csn directory is assumed to have been
+ * created by initdb, and CSNLogShmemInit must have been called already.
+ */
+void
+BootStrapCSNLog(void)
+{
+ int slotno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Create and zero the first page of the commit log */
+ slotno = ZeroCSNLogPage(0);
+
+ /* Make sure it's written out */
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of CSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCSNLogPage(int pageno)
+{
+ Assert(LWLockHeldByMe(CSNLogControlLock));
+ return SimpleLruZeroPage(CsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ */
+void
+StartupCSNLog(TransactionId oldestActiveXID)
+{
+ int startPage;
+ int endPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Since we don't expect pg_csn to be valid across crashes, we
+ * initialize the currently-active page(s) to zeroes during startup.
+ * Whenever we advance into a new page, ExtendCSNLog will likewise
+ * zero the new page without regard to whatever was previously on disk.
+ */
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ startPage = TransactionIdToPage(oldestActiveXID);
+ endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
+
+ while (startPage != endPage)
+ {
+ (void) ZeroCSNLogPage(startPage);
+ startPage++;
+ /* must account for wraparound */
+ if (startPage > TransactionIdToPage(MaxTransactionId))
+ startPage = 0;
+ }
+ (void) ZeroCSNLogPage(startPage);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely as a debugging aid.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(false);
+ SimpleLruFlush(CsnlogCtl, false);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCSNLog(void)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * Flush dirty CSNLog pages to disk.
+ *
+ * This is not actually necessary from a correctness point of view. We do
+ * it merely to improve the odds that writing of dirty pages is done by
+ * the checkpoint process and not by backends.
+ */
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_START(true);
+ SimpleLruFlush(CsnlogCtl, true);
+ TRACE_POSTGRESQL_CSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that CSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock. We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCSNLog(TransactionId newestXact)
+{
+ int pageno;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * No work except at first XID of a page. But beware: just after
+ * wraparound, the first XID of page zero is FirstNormalTransactionId.
+ */
+ if (TransactionIdToPgIndex(newestXact) != 0 &&
+ !TransactionIdEquals(newestXact, FirstNormalTransactionId))
+ return;
+
+ pageno = TransactionIdToPage(newestXact);
+
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+
+ /* Zero the page and make an XLOG entry about it */
+ ZeroCSNLogPage(pageno);
+
+ LWLockRelease(CSNLogControlLock);
+}
+
+/*
+ * Remove all CSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateCSNLog(TransactionId oldestXact)
+{
+ int cutoffPage;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /*
+ * The cutoff point is the start of the segment containing oldestXact. We
+ * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+ * back one transaction to avoid passing a cutoff page that hasn't been
+ * created yet in the rare case that oldestXact would be the first item on
+ * a page and oldestXact == next XID. In that case, if we didn't subtract
+ * one, we'd trigger SimpleLruTruncate's wraparound detection.
+ */
+ TransactionIdRetreat(oldestXact);
+ cutoffPage = TransactionIdToPage(oldestXact);
+
+ SimpleLruTruncate(CsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two CSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic. However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs. So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CSNLogPagePrecedes(int page1, int page2)
+{
+ TransactionId xid1;
+ TransactionId xid2;
+
+ xid1 = ((TransactionId) page1) * CSN_LOG_XACTS_PER_PAGE;
+ xid1 += FirstNormalTransactionId;
+ xid2 = ((TransactionId) page2) * CSN_LOG_XACTS_PER_PAGE;
+ xid2 += FirstNormalTransactionId;
+
+ return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
new file mode 100644
index 0000000000..bcc5bac757
--- /dev/null
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -0,0 +1,340 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.c
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/csn_snapshot.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
+#include "access/transam.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "portability/instr_time.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+#include "miscadmin.h"
+
+/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
+#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+
+/*
+ * CSNSnapshotState
+ *
+ * Do not trust local clocks to be strictly monotonical and save last acquired
+ * value so later we can compare next timestamp with it. Accessed through
+ * GenerateCSN().
+ */
+typedef struct
+{
+ SnapshotCSN last_max_csn;
+ volatile slock_t lock;
+} CSNSnapshotState;
+
+static CSNSnapshotState *csnState;
+
+/*
+ * Enables this module.
+ */
+extern bool enable_csn_snapshot;
+
+
+/* Estimate shared memory space needed */
+Size
+CSNSnapshotShmemSize(void)
+{
+ Size size = 0;
+
+ if (enable_csn_snapshot)
+ {
+ size += MAXALIGN(sizeof(CSNSnapshotState));
+ }
+
+ return size;
+}
+
+/* Init shared memory structures */
+void
+CSNSnapshotShmemInit()
+{
+ bool found;
+
+ if (enable_csn_snapshot)
+ {
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
+ {
+ csnState->last_max_csn = 0;
+ SpinLockInit(&csnState->lock);
+ }
+ }
+}
+
+/*
+ * GenerateCSN
+ *
+ * Generate SnapshotCSN which is actually a local time. Also we are forcing
+ * this time to be always increasing. Since now it is not uncommon to have
+ * millions of read transactions per second we are trying to use nanoseconds
+ * if such time resolution is available.
+ */
+SnapshotCSN
+GenerateCSN(bool locked)
+{
+ instr_time current_time;
+ SnapshotCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ /*
+ * TODO: create some macro that add small random shift to current time.
+ */
+ INSTR_TIME_SET_CURRENT(current_time);
+ csn = (SnapshotCSN) INSTR_TIME_GET_NANOSEC(current_time);
+
+ /* TODO: change to atomics? */
+ if (!locked)
+ SpinLockAcquire(&csnState->lock);
+
+ if (csn <= csnState->last_max_csn)
+ csn = ++csnState->last_max_csn;
+ else
+ csnState->last_max_csn = csn;
+
+ if (!locked)
+ SpinLockRelease(&csnState->lock);
+
+ return csn;
+}
+
+/*
+ * TransactionIdGetXidCSN
+ *
+ * Get XidCSN for specified TransactionId taking care about special xids,
+ * xids beyond TransactionXmin and InDoubt states.
+ */
+XidCSN
+TransactionIdGetXidCSN(TransactionId xid)
+{
+ XidCSN xid_csn;
+
+ Assert(enable_csn_snapshot);
+
+ /* Handle permanent TransactionId's for which we don't have mapping */
+ if (!TransactionIdIsNormal(xid))
+ {
+ if (xid == InvalidTransactionId)
+ return AbortedXidCSN;
+ if (xid == FrozenTransactionId || xid == BootstrapTransactionId)
+ return FrozenXidCSN;
+ Assert(false); /* Should not happend */
+ }
+
+ /*
+ * For xids which less then TransactionXmin CSNLog can be already
+ * trimmed but we know that such transaction is definetly not concurrently
+ * running according to any snapshot including timetravel ones. Callers
+ * should check TransactionDidCommit after.
+ */
+ if (TransactionIdPrecedes(xid, TransactionXmin))
+ return FrozenXidCSN;
+
+ /* Read XidCSN from SLRU */
+ xid_csn = CSNLogGetCSNByXid(xid);
+
+ /*
+ * If we faced InDoubt state then transaction is beeing committed and we
+ * should wait until XidCSN will be assigned so that visibility check
+ * could decide whether tuple is in snapshot. See also comments in
+ * CSNSnapshotPrecommit().
+ */
+ if (XidCSNIsInDoubt(xid_csn))
+ {
+ XactLockTableWait(xid, NULL, NULL, XLTW_None);
+ xid_csn = CSNLogGetCSNByXid(xid);
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+ }
+
+ Assert(XidCSNIsNormal(xid_csn) ||
+ XidCSNIsInProgress(xid_csn) ||
+ XidCSNIsAborted(xid_csn));
+
+ return xid_csn;
+}
+
+/*
+ * XidInvisibleInCSNSnapshot
+ *
+ * Version of XidInMVCCSnapshot for transactions. For non-imported
+ * csn snapshots this should give same results as XidInLocalMVCCSnapshot
+ * (except that aborts will be shown as invisible without going to clog) and to
+ * ensure such behaviour XidInMVCCSnapshot is coated with asserts that checks
+ * identicalness of XidInvisibleInCSNSnapshot/XidInLocalMVCCSnapshot in
+ * case of ordinary snapshot.
+ */
+bool
+XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ XidCSN csn;
+
+ Assert(enable_csn_snapshot);
+
+ csn = TransactionIdGetXidCSN(xid);
+
+ if (XidCSNIsNormal(csn))
+ {
+ if (csn < snapshot->snapshot_csn)
+ return false;
+ else
+ return true;
+ }
+ else if (XidCSNIsFrozen(csn))
+ {
+ /* It is bootstrap or frozen transaction */
+ return false;
+ }
+ else
+ {
+ /* It is aborted or in-progress */
+ Assert(XidCSNIsAborted(csn) || XidCSNIsInProgress(csn));
+ if (XidCSNIsAborted(csn))
+ Assert(TransactionIdDidAbort(xid));
+ return true;
+ }
+}
+
+
+/*****************************************************************************
+ * Functions to handle transactions commit.
+ *
+ * For local transactions CSNSnapshotPrecommit sets InDoubt state before
+ * ProcArrayEndTransaction is called and transaction data potetntially becomes
+ * visible to other backends. ProcArrayEndTransaction (or ProcArrayRemove in
+ * twophase case) then acquires xid_csn under ProcArray lock and stores it
+ * in proc->assignedXidCsn. It's important that xid_csn for commit is
+ * generated under ProcArray lock, otherwise snapshots won't
+ * be equivalent. Consequent call to CSNSnapshotCommit will write
+ * proc->assignedXidCsn to CSNLog.
+ *
+ *
+ * CSNSnapshotAbort is slightly different comparing to commit because abort
+ * can skip InDoubt phase and can be called for transaction subtree.
+ *****************************************************************************/
+
+
+/*
+ * CSNSnapshotAbort
+ *
+ * Abort transaction in CsnLog. We can skip InDoubt state for aborts
+ * since no concurrent transactions allowed to see aborted data anyway.
+ */
+void
+CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ if (!enable_csn_snapshot)
+ return;
+
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+
+ /*
+ * Clean assignedXidCsn anyway, as it was possibly set in
+ * XidSnapshotAssignCsnCurrent.
+ */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
+
+/*
+ * CSNSnapshotPrecommit
+ *
+ * Set InDoubt status for local transaction that we are going to commit.
+ * This step is needed to achieve consistency between local snapshots and
+ * csn-based snapshots. We don't hold ProcArray lock while writing
+ * csn for transaction in SLRU but instead we set InDoubt status before
+ * transaction is deleted from ProcArray so the readers who will read csn
+ * in the gap between ProcArray removal and XidCSN assignment can wait
+ * until XidCSN is finally assigned. See also TransactionIdGetXidCSN().
+ *
+ * This should be called only from parallel group leader before backend is
+ * deleted from ProcArray.
+ */
+void
+CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ XidCSN oldassignedXidCsn = InProgressXidCSN;
+ bool in_progress;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ /* Set InDoubt status if it is local transaction */
+ in_progress = pg_atomic_compare_exchange_u64(&proc->assignedXidCsn,
+ &oldassignedXidCsn,
+ InDoubtXidCSN);
+ if (in_progress)
+ {
+ Assert(XidCSNIsInProgress(oldassignedXidCsn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, InDoubtXidCSN);
+ }
+ else
+ {
+ /* Otherwise we should have valid XidCSN by this time */
+ Assert(XidCSNIsNormal(oldassignedXidCsn));
+ Assert(XidCSNIsInDoubt(CSNLogGetCSNByXid(xid)));
+ }
+}
+
+/*
+ * CSNSnapshotCommit
+ *
+ * Write XidCSN that were acquired earlier to CsnLog. Should be
+ * preceded by CSNSnapshotPrecommit() so readers can wait until we finally
+ * finished writing to SLRU.
+ *
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, so that TransactionIdGetXidCSN can wait on this
+ * lock for XidCSN.
+ */
+void
+CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
+ int nsubxids, TransactionId *subxids)
+{
+ volatile XidCSN assigned_xid_csn;
+
+ if (!enable_csn_snapshot)
+ return;
+
+ if (!TransactionIdIsValid(xid))
+ {
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsInProgress(assigned_xid_csn));
+ return;
+ }
+
+ /* Finally write resulting XidCSN in SLRU */
+ assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
+ Assert(XidCSNIsNormal(assigned_xid_csn));
+ CSNLogSetCSN(xid, nsubxids,
+ subxids, assigned_xid_csn);
+
+ /* Reset for next transaction */
+ pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0e..57bda5d422 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,8 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
+#include "access/csn_log.h"
#include "access/htup_details.h"
#include "access/subtrans.h"
#include "access/transam.h"
@@ -1479,8 +1481,34 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nabortrels, abortrels,
gid);
+ /*
+ * CSNSnapshot callbacks that should be called right before we are
+ * going to become visible. Details in comments to this functions.
+ */
+ if (isCommit)
+ CSNSnapshotPrecommit(proc, xid, hdr->nsubxacts, children);
+ else
+ CSNSnapshotAbort(proc, xid, hdr->nsubxacts, children);
+
+
ProcArrayRemove(proc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CSNLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, since TransactionIdGetXidCSN relies on
+ * XactLockTableWait to await xid_csn.
+ */
+ if (isCommit)
+ {
+ CSNSnapshotCommit(proc, xid, hdr->nsubxacts, children);
+ }
+ else
+ {
+ Assert(XidCSNIsInProgress(
+ pg_atomic_read_u64(&proc->assignedXidCsn)));
+ }
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e..b045ed09f3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/xact.h"
@@ -173,6 +174,7 @@ GetNewTransactionId(bool isSubXact)
* Extend pg_subtrans and pg_commit_ts too.
*/
ExtendCLOG(xid);
+ ExtendCSNLog(xid);
ExtendCommitTs(xid);
ExtendSUBTRANS(xid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b3ee7fa7ea..9321634d60 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
#include <unistd.h>
#include "access/commit_ts.h"
+#include "access/csn_snapshot.h"
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
@@ -1435,6 +1436,14 @@ RecordTransactionCommit(void)
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd = 0;
+
+ /*
+ * Mark our transaction as InDoubt in CsnLog and get ready for
+ * commit.
+ */
+ if (markXidCommitted)
+ CSNSnapshotPrecommit(MyProc, xid, nchildren, children);
+
cleanup:
/* Clean up local data */
if (rels)
@@ -1696,6 +1705,11 @@ RecordTransactionAbort(bool isSubXact)
*/
TransactionIdAbortTree(xid, nchildren, children);
+ /*
+ * Mark our transaction as Aborted in CsnLog.
+ */
+ CSNSnapshotAbort(MyProc, xid, nchildren, children);
+
END_CRIT_SECTION();
/* Compute latestXid while we have the child XIDs handy */
@@ -2185,6 +2199,21 @@ CommitTransaction(void)
*/
ProcArrayEndTransaction(MyProc, latestXid);
+ /*
+ * Stamp our transaction with XidCSN in CsnLog.
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks.
+ */
+ if (!is_parallel_worker)
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+ TransactionId *subxids;
+ int nsubxids;
+
+ nsubxids = xactGetCommittedChildren(&subxids);
+ CSNSnapshotCommit(MyProc, xid, nsubxids, subxids);
+ }
+
/*
* This is all post-commit cleanup. Note that if an error is raised here,
* it's too late to abort the transaction. This should be just
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0a97b1d37f..8f21e09a03 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
@@ -5345,6 +5346,7 @@ BootStrapXLOG(void)
/* Bootstrap the commit log, too */
BootStrapCLOG();
+ BootStrapCSNLog();
BootStrapCommitTs();
BootStrapSUBTRANS();
BootStrapMultiXact();
@@ -7062,6 +7064,7 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7879,6 +7882,7 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
+ StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -8527,6 +8531,7 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
ShutdownCLOG();
+ ShutdownCSNLog();
ShutdownCommitTs();
ShutdownSUBTRANS();
ShutdownMultiXact();
@@ -9099,7 +9104,10 @@ CreateCheckPoint(int flags)
* StartupSUBTRANS hasn't been called yet.
*/
if (!RecoveryInProgress())
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update stats before releasing lock. */
LogCheckpointEnd(false);
@@ -9175,6 +9183,7 @@ static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointCLOG();
+ CheckPointCSNLog();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
@@ -9459,7 +9468,10 @@ CreateRestartPoint(int flags)
* this because StartupSUBTRANS hasn't been called yet.
*/
if (EnableHotStandby)
+ {
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+ }
/* Real work is done, but log and update before releasing lock. */
LogCheckpointEnd(true);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..7122babfd6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,11 +16,13 @@
#include "access/clog.h"
#include "access/commit_ts.h"
+#include "access/csn_log.h"
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -125,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
size = add_size(size, CLOGShmemSize());
+ size = add_size(size, CSNLogShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
size = add_size(size, TwoPhaseShmemSize());
@@ -143,6 +146,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, WalSndShmemSize());
size = add_size(size, WalRcvShmemSize());
size = add_size(size, ApplyLauncherShmemSize());
+ size = add_size(size, CSNSnapshotShmemSize());
size = add_size(size, SnapMgrShmemSize());
size = add_size(size, BTreeShmemSize());
size = add_size(size, SyncScanShmemSize());
@@ -213,6 +217,7 @@ CreateSharedMemoryAndSemaphores(void)
*/
XLOGShmemInit();
CLOGShmemInit();
+ CSNLogShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
@@ -264,6 +269,7 @@ CreateSharedMemoryAndSemaphores(void)
SyncScanShmemInit();
AsyncShmemInit();
+ CSNSnapshotShmemInit();
#ifdef EXEC_BACKEND
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b448533564..d715750437 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,8 @@
#include <signal.h>
#include "access/clog.h"
+#include "access/csn_log.h"
+#include "access/csn_snapshot.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -352,6 +354,14 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT PREPARED. After lock is released consequent
+ * CSNSnapshotCommit() will write this value to CsnLog.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
else
{
@@ -467,6 +477,16 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /*
+ * Assign xid csn while holding ProcArrayLock for
+ * COMMIT.
+ *
+ * TODO: in case of group commit we can generate one CSNSnapshot for
+ * whole group to save time on timestamp aquisition.
+ */
+ if (XidCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedXidCsn)))
+ pg_atomic_write_u64(&proc->assignedXidCsn, GenerateCSN(false));
}
/*
@@ -833,6 +853,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
{
ExtendSUBTRANS(latestObservedXid);
+ ExtendCSNLog(latestObservedXid);
TransactionIdAdvance(latestObservedXid);
}
TransactionIdRetreat(latestObservedXid); /* = running->nextXid - 1 */
@@ -1511,6 +1532,7 @@ GetSnapshotData(Snapshot snapshot)
int count = 0;
int subcount = 0;
bool suboverflowed = false;
+ XidCSN xid_csn = FrozenXidCSN;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1708,6 +1730,13 @@ GetSnapshotData(Snapshot snapshot)
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
+ /*
+ * Take XidCSN under ProcArrayLock so the snapshot stays
+ * synchronized.
+ */
+ if (enable_csn_snapshot)
+ xid_csn = GenerateCSN(false);
+
LWLockRelease(ProcArrayLock);
/*
@@ -1778,6 +1807,8 @@ GetSnapshotData(Snapshot snapshot)
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
+ snapshot->snapshot_csn = xid_csn;
+
return snapshot;
}
@@ -3335,6 +3366,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
while (TransactionIdPrecedes(next_expected_xid, xid))
{
TransactionIdAdvance(next_expected_xid);
+ ExtendCSNLog(next_expected_xid);
ExtendSUBTRANS(next_expected_xid);
}
Assert(next_expected_xid == xid);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..77b8426e71 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -134,6 +134,8 @@ static const char *const BuiltinTrancheNames[] = {
"CommitTSBuffer",
/* LWTRANCHE_SUBTRANS_BUFFER: */
"SubtransBuffer",
+ /* LWTRANCHE_CSN_LOG_BUFFERS */
+ "CsnLogBuffer",
/* LWTRANCHE_MULTIXACTOFFSET_BUFFER: */
"MultiXactOffsetBuffer",
/* LWTRANCHE_MULTIXACTMEMBER_BUFFER: */
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6985e8eed..3c95ce4aac 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ MultiXactTruncationLock 41
OldSnapshotTimeMapLock 42
LogicalRepWorkerLock 43
XactTruncationLock 44
+CSNLogControlLock 45
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e57fcd2538..a6b8625ce5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -37,6 +37,7 @@
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/csn_snapshot.h"
#include "access/xact.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -441,6 +442,8 @@ InitProcess(void)
MyProc->clogGroupMemberLsn = InvalidXLogRecPtr;
Assert(pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PGPROCNO);
+ pg_atomic_init_u64(&MyProc->assignedXidCsn, InProgressXidCSN);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..1e9bcc7aee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/csn_snapshot.h"
#include "access/rmgr.h"
#include "access/tableam.h"
#include "access/transam.h"
@@ -1153,6 +1154,15 @@ static struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_csn_snapshot", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Enable csn-base snapshot."),
+ gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
+ },
+ &enable_csn_snapshot,
+ true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ NULL, NULL, NULL
+ },
{
{"ssl", PGC_SIGHUP, CONN_AUTH_SSL,
gettext_noop("Enables SSL connections."),
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index a0b0458108..679c531622 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
probe clog__checkpoint__done(bool);
probe subtrans__checkpoint__start(bool);
probe subtrans__checkpoint__done(bool);
+ probe csnlog__checkpoint__start(bool);
+ probe csnlog__checkpoint__done(bool);
probe multixact__checkpoint__start(bool);
probe multixact__checkpoint__done(bool);
probe twophase__checkpoint__start();
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6b6c8571e2..45fe574620 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -229,6 +229,7 @@ static TimestampTz AlignTimestampToMinuteBoundary(TimestampTz ts);
static Snapshot CopySnapshot(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
+static bool XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot);
/*
* Snapshot fields to be serialized.
@@ -247,6 +248,7 @@ typedef struct SerializedSnapshotData
CommandId curcid;
TimestampTz whenTaken;
XLogRecPtr lsn;
+ XidCSN xid_csn;
} SerializedSnapshotData;
Size
@@ -2115,6 +2117,7 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
serialized_snapshot.curcid = snapshot->curcid;
serialized_snapshot.whenTaken = snapshot->whenTaken;
serialized_snapshot.lsn = snapshot->lsn;
+ serialized_snapshot.xid_csn = snapshot->snapshot_csn;
/*
* Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2189,6 +2192,7 @@ RestoreSnapshot(char *start_address)
snapshot->curcid = serialized_snapshot.curcid;
snapshot->whenTaken = serialized_snapshot.whenTaken;
snapshot->lsn = serialized_snapshot.lsn;
+ snapshot->snapshot_csn = serialized_snapshot.xid_csn;
/* Copy XIDs, if present. */
if (serialized_snapshot.xcnt > 0)
@@ -2229,6 +2233,47 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
/*
* XidInMVCCSnapshot
+ *
+ * Check whether this xid is in snapshot. When enable_csn_snapshot is
+ * switched off just call XidInLocalMVCCSnapshot().
+ */
+bool
+XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+{
+ bool in_snapshot;
+
+ in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
+
+ if (!enable_csn_snapshot)
+ {
+ Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
+ return in_snapshot;
+ }
+
+ if (in_snapshot)
+ {
+ /*
+ * This xid may be already in unknown state and in that case
+ * we must wait and recheck.
+ */
+ return XidInvisibleInCSNSnapshot(xid, snapshot);
+ }
+ else
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* Check that csn snapshot gives the same results as local one */
+ if (XidInvisibleInCSNSnapshot(xid, snapshot))
+ {
+ XidCSN gcsn = TransactionIdGetXidCSN(xid);
+ Assert(XidCSNIsAborted(gcsn));
+ }
+#endif
+ return false;
+ }
+}
+
+/*
+ * XidInLocalMVCCSnapshot
* Is the given XID still-in-progress according to the snapshot?
*
* Note: GetSnapshotData never stores either top xid or subxids of our own
@@ -2237,8 +2282,8 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
* TransactionIdIsCurrentTransactionId first, except when it's known the
* XID could not be ours anyway.
*/
-bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+static bool
+XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 786672b1b6..a52c01889d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -220,7 +220,8 @@ static const char *const subdirs[] = {
"pg_xact",
"pg_logical",
"pg_logical/snapshots",
- "pg_logical/mappings"
+ "pg_logical/mappings",
+ "pg_csn"
};
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
new file mode 100644
index 0000000000..9b9611127d
--- /dev/null
+++ b/src/include/access/csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * csn_log.h
+ *
+ * Commit-Sequence-Number log.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
+
+extern Size CSNLogShmemSize(void);
+extern void CSNLogShmemInit(void);
+extern void BootStrapCSNLog(void);
+extern void StartupCSNLog(TransactionId oldestActiveXID);
+extern void ShutdownCSNLog(void);
+extern void CheckPointCSNLog(void);
+extern void ExtendCSNLog(TransactionId newestXact);
+extern void TruncateCSNLog(TransactionId oldestXact);
+
+#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
new file mode 100644
index 0000000000..1894586204
--- /dev/null
+++ b/src/include/access/csn_snapshot.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * csn_snapshot.h
+ * Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/csn_snapshot.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CSN_SNAPSHOT_H
+#define CSN_SNAPSHOT_H
+
+#include "port/atomics.h"
+#include "storage/lock.h"
+#include "utils/snapshot.h"
+#include "utils/guc.h"
+
+/*
+ * snapshot.h is used in frontend code so atomic variant of SnapshotCSN type
+ * is defined here.
+ */
+typedef pg_atomic_uint64 CSN_atomic;
+
+#define InProgressXidCSN UINT64CONST(0x0)
+#define AbortedXidCSN UINT64CONST(0x1)
+#define FrozenXidCSN UINT64CONST(0x2)
+#define InDoubtXidCSN UINT64CONST(0x3)
+#define FirstNormalXidCSN UINT64CONST(0x4)
+
+#define XidCSNIsInProgress(csn) ((csn) == InProgressXidCSN)
+#define XidCSNIsAborted(csn) ((csn) == AbortedXidCSN)
+#define XidCSNIsFrozen(csn) ((csn) == FrozenXidCSN)
+#define XidCSNIsInDoubt(csn) ((csn) == InDoubtXidCSN)
+#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
+
+
+
+
+extern Size CSNSnapshotShmemSize(void);
+extern void CSNSnapshotShmemInit(void);
+
+extern SnapshotCSN GenerateCSN(bool locked);
+
+extern bool XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot);
+
+extern XidCSN TransactionIdGetXidCSN(TransactionId xid);
+
+extern void CSNSnapshotAbort(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+extern void CSNSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
+ TransactionId *subxids);
+
+#endif /* CSN_SNAPSHOT_H */
diff --git a/src/include/datatype/timestamp.h b/src/include/datatype/timestamp.h
index 6be6d35d1e..583b1beea5 100644
--- a/src/include/datatype/timestamp.h
+++ b/src/include/datatype/timestamp.h
@@ -93,6 +93,9 @@ typedef struct
#define USECS_PER_MINUTE INT64CONST(60000000)
#define USECS_PER_SEC INT64CONST(1000000)
+#define NSECS_PER_SEC INT64CONST(1000000000)
+#define NSECS_PER_USEC INT64CONST(1000)
+
/*
* We allow numeric timezone offsets up to 15:59:59 either way from Greenwich.
* Currently, the record holders for wackiest offsets in actual use are zones
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index f25068fae2..6c3f2c7655 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -280,6 +280,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
#define PG_GETARG_FLOAT4(n) DatumGetFloat4(PG_GETARG_DATUM(n))
#define PG_GETARG_FLOAT8(n) DatumGetFloat8(PG_GETARG_DATUM(n))
#define PG_GETARG_INT64(n) DatumGetInt64(PG_GETARG_DATUM(n))
+#define PG_GETARG_UINT64(n) DatumGetUInt64(PG_GETARG_DATUM(n))
/* use this if you want the raw, possibly-toasted input datum: */
#define PG_GETARG_RAW_VARLENA_P(n) ((struct varlena *) PG_GETARG_POINTER(n))
/* use this if you want the input datum de-toasted: */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..4ac23da654 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -141,6 +141,9 @@ typedef struct timespec instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
#else /* !HAVE_CLOCK_GETTIME */
/* Use gettimeofday() */
@@ -205,6 +208,10 @@ typedef struct timeval instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (((uint64) (t).tv_sec * (uint64) 1000000000) + \
+ (uint64) (t).tv_usec * (uint64) 1000)
+
#endif /* HAVE_CLOCK_GETTIME */
#else /* WIN32 */
@@ -237,6 +244,9 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_GET_MICROSEC(t) \
((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
static inline double
GetTimerFrequency(void)
{
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..6188691fb2 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -196,6 +196,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_XACT_BUFFER = NUM_INDIVIDUAL_LWLOCKS,
LWTRANCHE_COMMITTS_BUFFER,
LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_CSN_LOG_BUFFERS,
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
LWTRANCHE_NOTIFY_BUFFER,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b20e2ad4f6..3ff7ea4fce 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,8 +15,10 @@
#define _PROC_H_
#include "access/clog.h"
+#include "access/csn_snapshot.h"
#include "access/xlogdefs.h"
#include "lib/ilist.h"
+#include "utils/snapshot.h"
#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
@@ -210,6 +212,16 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * assignedXidCsn holds XidCSN for this transaction. It is generated
+ * under a ProcArray lock and later is writter to a CSNLog. This
+ * variable defined as atomic only for case of group commit, in all other
+ * scenarios only backend responsible for this proc entry is working with
+ * this variable.
+ */
+ CSN_atomic assignedXidCsn;
+
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63a..9f622c76a7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -121,6 +121,9 @@ typedef enum SnapshotType
typedef struct SnapshotData *Snapshot;
#define InvalidSnapshot ((Snapshot) NULL)
+typedef uint64 XidCSN;
+typedef uint64 SnapshotCSN;
+extern bool enable_csn_snapshot;
/*
* Struct representing all kind of possible snapshots.
@@ -201,6 +204,12 @@ typedef struct SnapshotData
TimestampTz whenTaken; /* timestamp when snapshot was taken */
XLogRecPtr lsn; /* position in the WAL stream when taken */
+
+ /*
+ * SnapshotCSN for snapshot isolation support.
+ * Will be used only if enable_csn_snapshot is enabled.
+ */
+ SnapshotCSN snapshot_csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..da2e5aa38b 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,6 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
+ enable_csn_snapshot | on
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(18 rows)
+(19 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
0002-Wal-for-csn.patchapplication/octet-stream; name=0002-Wal-for-csn.patchDownload
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..15fc36f7b4 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
brindesc.o \
clogdesc.o \
+ csnlogdesc.o \
committsdesc.o \
dbasedesc.o \
genericdesc.o \
diff --git a/src/backend/access/rmgrdesc/csnlogdesc.c b/src/backend/access/rmgrdesc/csnlogdesc.c
new file mode 100644
index 0000000000..e96b056325
--- /dev/null
+++ b/src/backend/access/rmgrdesc/csnlogdesc.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * clogdesc.c
+ * rmgr descriptor routines for access/transam/csn_log.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/csnlogdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/csn_log.h"
+
+
+void
+csnlog_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ appendStringInfo(buf, "pageno %d", pageno);
+ }
+ else if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ appendStringInfo(buf, "assign "INT64_FORMAT"", csn);
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) rec;
+ int nsubxids;
+
+ appendStringInfo(buf, "set "INT64_FORMAT" for: %u",
+ xlrec->xidcsn,
+ xlrec->xtop);
+ nsubxids = ((XLogRecGetDataLen(record) - MinSizeOfXidCSNSet) /
+ sizeof(TransactionId));
+ if (nsubxids > 0)
+ {
+ int i;
+ TransactionId *subxids;
+
+ subxids = palloc(sizeof(TransactionId) * nsubxids);
+ memcpy(subxids,
+ XLogRecGetData(record) + MinSizeOfXidCSNSet,
+ sizeof(TransactionId) * nsubxids);
+ for (i = 0; i < nsubxids; i++)
+ appendStringInfo(buf, ", %u", subxids[i]);
+ pfree(subxids);
+ }
+ }
+}
+
+const char *
+csnlog_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_CSN_ASSIGNMENT:
+ id = "ASSIGNMENT";
+ break;
+ case XLOG_CSN_SETXIDCSN:
+ id = "SETXIDCSN";
+ break;
+ case XLOG_CSN_ZEROPAGE:
+ id = "ZEROPAGE";
+ break;
+ case XLOG_CSN_TRUNCATE:
+ id = "TRUNCATE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..44e2e8ecec 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -114,7 +114,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
"max_wal_senders=%d max_prepared_xacts=%d "
"max_locks_per_xact=%d wal_level=%s "
- "wal_log_hints=%s track_commit_timestamp=%s",
+ "wal_log_hints=%s track_commit_timestamp=%s "
+ "enable_csn_snapshot=%s",
xlrec.MaxConnections,
xlrec.max_worker_processes,
xlrec.max_wal_senders,
@@ -122,7 +123,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
xlrec.max_locks_per_xact,
wal_level_str,
xlrec.wal_log_hints ? "on" : "off",
- xlrec.track_commit_timestamp ? "on" : "off");
+ xlrec.track_commit_timestamp ? "on" : "off",
+ xlrec.enable_csn_snapshot ? "on" : "off");
}
else if (info == XLOG_FPW_CHANGE)
{
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 4e0b8d64e4..22a95cb5d3 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -9,6 +9,11 @@
* transactions. Because of same lifetime and persistancy requirements
* this module is quite similar to subtrans.c
*
+ * If we switch database from CSN-base snapshot to xid-base snapshot then,
+ * nothing wrong. But if we switch xid-base snapshot to CSN-base snapshot
+ * it should decide a new xid whwich begin csn-base check. It can not be
+ * oldestActiveXID because of prepared transaction.
+ *
* Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -52,7 +57,8 @@ bool enable_csn_snapshot;
static SlruCtlData CSNLogCtlData;
#define CsnlogCtl (&CSNLogCtlData)
-static int ZeroCSNLogPage(int pageno);
+static int ZeroCSNLogPage(int pageno, bool write_xlog);
+static void ZeroTruncateCSNLogPage(int pageno, bool write_xlog);
static bool CSNLogPagePrecedes(int page1, int page2);
static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids,
@@ -60,6 +66,11 @@ static void CSNLogSetPageStatus(TransactionId xid, int nsubxids,
static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
int slotno);
+static void WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn);
+static void WriteZeroCSNPageXlogRec(int pageno);
+static void WriteTruncateCSNXlogRec(int pageno);
+
/*
* CSNLogSetCSN
*
@@ -77,7 +88,7 @@ static void CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn,
*/
void
CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn)
+ TransactionId *subxids, XidCSN csn, bool write_xlog)
{
int pageno;
int i = 0;
@@ -89,6 +100,10 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
+
+ if(write_xlog)
+ WriteXidCsnXlogRec(xid, nsubxids, subxids, csn);
+
for (;;)
{
int num_on_page = 0;
@@ -151,12 +166,12 @@ CSNLogSetPageStatus(TransactionId xid, int nsubxids,
static void
CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
{
- int entryno = TransactionIdToPgIndex(xid);
- XidCSN *ptr;
+ int entryno = TransactionIdToPgIndex(xid);
+ XidCSN *ptr;
Assert(LWLockHeldByMe(CSNLogControlLock));
- ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+ ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XidCSN));
*ptr = csn;
}
@@ -171,27 +186,21 @@ CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
XidCSN
CSNLogGetCSNByXid(TransactionId xid)
{
- int pageno = TransactionIdToPage(xid);
- int entryno = TransactionIdToPgIndex(xid);
- int slotno;
- XidCSN *ptr;
- XidCSN xid_csn;
+ int pageno = TransactionIdToPage(xid);
+ int entryno = TransactionIdToPgIndex(xid);
+ int slotno;
+ XidCSN csn;
/* Callers of CSNLogGetCSNByXid() must check GUC params */
Assert(enable_csn_snapshot);
- /* Can't ask about stuff that might not be around anymore */
- Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
-
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
- ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
- xid_csn = *ptr;
+ csn = *(XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
LWLockRelease(CSNLogControlLock);
- return xid_csn;
+ return csn;
}
/*
@@ -245,7 +254,7 @@ BootStrapCSNLog(void)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0);
+ slotno = ZeroCSNLogPage(0, false);
/* Make sure it's written out */
SimpleLruWritePage(CsnlogCtl, slotno);
@@ -263,50 +272,20 @@ BootStrapCSNLog(void)
* Control lock must be held at entry, and will be held at exit.
*/
static int
-ZeroCSNLogPage(int pageno)
+ZeroCSNLogPage(int pageno, bool write_xlog)
{
Assert(LWLockHeldByMe(CSNLogControlLock));
+ if(write_xlog)
+ WriteZeroCSNPageXlogRec(pageno);
return SimpleLruZeroPage(CsnlogCtl, pageno);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend startup,
- * after StartupXLOG has initialized ShmemVariableCache->nextXid.
- *
- * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
- * if there are none.
- */
-void
-StartupCSNLog(TransactionId oldestActiveXID)
+static void
+ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
{
- int startPage;
- int endPage;
-
- if (!enable_csn_snapshot)
- return;
-
- /*
- * Since we don't expect pg_csn to be valid across crashes, we
- * initialize the currently-active page(s) to zeroes during startup.
- * Whenever we advance into a new page, ExtendCSNLog will likewise
- * zero the new page without regard to whatever was previously on disk.
- */
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- startPage = TransactionIdToPage(oldestActiveXID);
- endPage = TransactionIdToPage(XidFromFullTransactionId(ShmemVariableCache->nextFullXid));
-
- while (startPage != endPage)
- {
- (void) ZeroCSNLogPage(startPage);
- startPage++;
- /* must account for wraparound */
- if (startPage > TransactionIdToPage(MaxTransactionId))
- startPage = 0;
- }
- (void) ZeroCSNLogPage(startPage);
-
- LWLockRelease(CSNLogControlLock);
+ if(write_xlog)
+ WriteTruncateCSNXlogRec(pageno);
+ SimpleLruTruncate(CsnlogCtl, pageno);
}
/*
@@ -379,7 +358,7 @@ ExtendCSNLog(TransactionId newestXact)
LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
- ZeroCSNLogPage(pageno);
+ ZeroCSNLogPage(pageno, !InRecovery);
LWLockRelease(CSNLogControlLock);
}
@@ -410,7 +389,7 @@ TruncateCSNLog(TransactionId oldestXact)
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
- SimpleLruTruncate(CsnlogCtl, cutoffPage);
+ ZeroTruncateCSNLogPage(cutoffPage, true);
}
/*
@@ -436,3 +415,121 @@ CSNLogPagePrecedes(int page1, int page2)
return TransactionIdPrecedes(xid1, xid2);
}
+
+void
+WriteAssignCSNXlogRec(XidCSN xidcsn)
+{
+ XidCSN log_csn = 0;
+
+ if(xidcsn <= get_last_log_wal_csn())
+ {
+ /*
+ * WAL-write related code. If concurrent backend already wrote into WAL
+ * its CSN with bigger value it isn't needed to write this value.
+ */
+ return;
+ }
+
+ /*
+ * We log the CSN 5s greater than generated, you can see comments on
+ * CSN_ASSIGN_TIME_INTERVAL define.
+ */
+ log_csn = CSNAddByNanosec(xidcsn, CSN_ASSIGN_TIME_INTERVAL);
+ set_last_log_wal_csn(log_csn);
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&log_csn), sizeof(XidCSN));
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ASSIGNMENT);
+}
+
+static void
+WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
+ TransactionId *subxids, XidCSN csn)
+{
+ xl_xidcsn_set xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.xtop = xid;
+ xlrec.nsubxacts = nsubxids;
+ xlrec.xidcsn = csn;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
+ XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
+ XLogFlush(recptr);
+}
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroCSNPageXlogRec(int pageno)
+{
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ (void) XLogInsert(RM_CSNLOG_ID, XLOG_CSN_ZEROPAGE);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateCSNXlogRec(int pageno)
+{
+ XLogRecPtr recptr;
+ return;
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&pageno), sizeof(int));
+ recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
+ XLogFlush(recptr);
+}
+
+
+void
+csnlog_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ /* Backup blocks are not used in csnlog records */
+ Assert(!XLogRecHasAnyBlockRefs(record));
+
+ if (info == XLOG_CSN_ASSIGNMENT)
+ {
+ XidCSN csn;
+
+ memcpy(&csn, XLogRecGetData(record), sizeof(XidCSN));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ set_last_max_csn(csn);
+ LWLockRelease(CSNLogControlLock);
+
+ }
+ else if (info == XLOG_CSN_SETXIDCSN)
+ {
+ xl_xidcsn_set *xlrec = (xl_xidcsn_set *) XLogRecGetData(record);
+ CSNLogSetCSN(xlrec->xtop, xlrec->nsubxacts, xlrec->xsub, xlrec->xidcsn, false);
+ }
+ else if (info == XLOG_CSN_ZEROPAGE)
+ {
+ int pageno;
+ int slotno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(pageno, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ Assert(!CsnlogCtl->shared->page_dirty[slotno]);
+
+ }
+ else if (info == XLOG_CSN_TRUNCATE)
+ {
+ int pageno;
+
+ memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ CsnlogCtl->shared->latest_page_number = pageno;
+ ZeroTruncateCSNLogPage(pageno, false);
+ }
+ else
+ elog(PANIC, "csnlog_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index bcc5bac757..99e4a2f1ed 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -31,6 +31,8 @@
/* Raise a warning if imported snapshot_csn exceeds ours by this value. */
#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+TransactionId xmin_for_csn = InvalidTransactionId;
+
/*
* CSNSnapshotState
*
@@ -40,7 +42,9 @@
*/
typedef struct
{
- SnapshotCSN last_max_csn;
+ SnapshotCSN last_max_csn; /* Record the max csn till now */
+ XidCSN last_csn_log_wal; /* for interval we log the assign csn to wal */
+ TransactionId xmin_for_csn; /*'xmin_for_csn' for when turn xid-snapshot to csn-snapshot*/
volatile slock_t lock;
} CSNSnapshotState;
@@ -80,6 +84,7 @@ CSNSnapshotShmemInit()
if (!found)
{
csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
SpinLockInit(&csnState->lock);
}
}
@@ -119,6 +124,8 @@ GenerateCSN(bool locked)
if (!locked)
SpinLockRelease(&csnState->lock);
+ WriteAssignCSNXlogRec(csn);
+
return csn;
}
@@ -131,7 +138,7 @@ GenerateCSN(bool locked)
XidCSN
TransactionIdGetXidCSN(TransactionId xid)
{
- XidCSN xid_csn;
+ XidCSN xid_csn;
Assert(enable_csn_snapshot);
@@ -145,13 +152,35 @@ TransactionIdGetXidCSN(TransactionId xid)
Assert(false); /* Should not happend */
}
+ /*
+ * If we just switch a xid-snapsot to a csn_snapshot, we should handle a start
+ * xid for csn basse check. Just in case we have prepared transaction which
+ * hold the TransactionXmin but without CSN.
+ */
+ if(InvalidTransactionId == xmin_for_csn)
+ {
+ SpinLockAcquire(&csnState->lock);
+ if(InvalidTransactionId != csnState->xmin_for_csn)
+ xmin_for_csn = csnState->xmin_for_csn;
+ else
+ xmin_for_csn = FrozenTransactionId;
+
+ SpinLockRelease(&csnState->lock);
+ }
+
+ if ( FrozenTransactionId != xmin_for_csn ||
+ TransactionIdPrecedes(xmin_for_csn, TransactionXmin))
+ {
+ xmin_for_csn = TransactionXmin;
+ }
+
/*
* For xids which less then TransactionXmin CSNLog can be already
* trimmed but we know that such transaction is definetly not concurrently
* running according to any snapshot including timetravel ones. Callers
* should check TransactionDidCommit after.
*/
- if (TransactionIdPrecedes(xid, TransactionXmin))
+ if (TransactionIdPrecedes(xid, xmin_for_csn))
return FrozenXidCSN;
/* Read XidCSN from SLRU */
@@ -251,7 +280,7 @@ CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
if (!enable_csn_snapshot)
return;
- CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN);
+ CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
/*
* Clean assignedXidCsn anyway, as it was possibly set in
@@ -292,7 +321,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
{
Assert(XidCSNIsInProgress(oldassignedXidCsn));
CSNLogSetCSN(xid, nsubxids,
- subxids, InDoubtXidCSN);
+ subxids, InDoubtXidCSN, true);
}
else
{
@@ -333,8 +362,39 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
assigned_xid_csn = pg_atomic_read_u64(&proc->assignedXidCsn);
Assert(XidCSNIsNormal(assigned_xid_csn));
CSNLogSetCSN(xid, nsubxids,
- subxids, assigned_xid_csn);
+ subxids, assigned_xid_csn, true);
/* Reset for next transaction */
pg_atomic_write_u64(&proc->assignedXidCsn, InProgressXidCSN);
}
+
+void
+set_last_max_csn(XidCSN xidcsn)
+{
+ csnState->last_max_csn = xidcsn;
+}
+
+void
+set_last_log_wal_csn(XidCSN xidcsn)
+{
+ csnState->last_csn_log_wal = xidcsn;
+}
+
+XidCSN
+get_last_log_wal_csn(void)
+{
+ XidCSN last_csn_log_wal;
+
+ last_csn_log_wal = csnState->last_csn_log_wal;
+
+ return last_csn_log_wal;
+}
+
+/*
+ * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ */
+void
+set_xmin_for_csn(void)
+{
+ csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..b1e5ec350e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -28,6 +28,7 @@
#include "replication/origin.h"
#include "storage/standby.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f21e09a03..dc2e9ae874 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4607,6 +4607,7 @@ InitControlFile(uint64 sysidentifier)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
ControlFile->data_checksum_version = bootstrap_data_checksum_version;
}
@@ -7064,7 +7065,6 @@ StartupXLOG(void)
* maintained during recovery and need not be started yet.
*/
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
/*
@@ -7882,7 +7882,6 @@ StartupXLOG(void)
if (standbyState == STANDBY_DISABLED)
{
StartupCLOG();
- StartupCSNLog(oldestActiveXID);
StartupSUBTRANS(oldestActiveXID);
}
@@ -9106,7 +9105,6 @@ CreateCheckPoint(int flags)
if (!RecoveryInProgress())
{
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
- TruncateCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
}
/* Real work is done, but log and update stats before releasing lock. */
@@ -9736,7 +9734,8 @@ XLogReportParameters(void)
max_wal_senders != ControlFile->max_wal_senders ||
max_prepared_xacts != ControlFile->max_prepared_xacts ||
max_locks_per_xact != ControlFile->max_locks_per_xact ||
- track_commit_timestamp != ControlFile->track_commit_timestamp)
+ track_commit_timestamp != ControlFile->track_commit_timestamp ||
+ enable_csn_snapshot != ControlFile->enable_csn_snapshot)
{
/*
* The change in number of backend slots doesn't need to be WAL-logged
@@ -9758,6 +9757,7 @@ XLogReportParameters(void)
xlrec.wal_level = wal_level;
xlrec.wal_log_hints = wal_log_hints;
xlrec.track_commit_timestamp = track_commit_timestamp;
+ xlrec.enable_csn_snapshot = enable_csn_snapshot;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9768,6 +9768,9 @@ XLogReportParameters(void)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
+ set_xmin_for_csn();
+
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -9776,6 +9779,7 @@ XLogReportParameters(void)
ControlFile->wal_level = wal_level;
ControlFile->wal_log_hints = wal_log_hints;
ControlFile->track_commit_timestamp = track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
@@ -10208,6 +10212,7 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
LWLockRelease(ControlFileLock);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e9..083a226dce 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -53,7 +53,7 @@
#include "utils/memutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
-
+#include "access/csn_log.h"
/*
* GUC parameters
@@ -1632,6 +1632,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
TruncateCLOG(frozenXID, oldestxid_datoid);
TruncateCommitTs(frozenXID);
+ TruncateCSNLog(frozenXID);
TruncateMultiXact(minMulti, minmulti_datoid);
/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index d715750437..9283021c7b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1734,7 +1734,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 45fe574620..5fa195b913 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2265,7 +2265,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
if (XidInvisibleInCSNSnapshot(xid, snapshot))
{
XidCSN gcsn = TransactionIdGetXidCSN(xid);
- Assert(XidCSNIsAborted(gcsn));
+ Assert(XidCSNIsAborted(gcsn) || XidCSNIsInProgress(gcsn));
}
#endif
return false;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..e7194124c7 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -306,6 +306,8 @@ main(int argc, char *argv[])
ControlFile->max_locks_per_xact);
printf(_("track_commit_timestamp setting: %s\n"),
ControlFile->track_commit_timestamp ? _("on") : _("off"));
+ printf(_("enable_csn_snapshot setting: %s\n"),
+ ControlFile->enable_csn_snapshot ? _("on") : _("off"));
printf(_("Maximum data alignment: %u\n"),
ControlFile->maxAlign);
/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 70194eb096..863ee73d24 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -545,6 +545,11 @@ copy_xact_xlog_xid(void)
check_ok();
}
+ if(old_cluster.controldata.cat_ver > CSN_BASE_SNAPSHOT_ADD_VER)
+ {
+ copy_subdir_files("pg_csn", "pg_csn");
+ }
+
/* now reset the wal archives in the new cluster */
prep_status("Resetting WAL archives");
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 8b90cefbe0..f35860dfc5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -123,6 +123,8 @@ extern char *output_files[];
*/
#define JSONB_FORMAT_CHANGE_CAT_VER 201409291
+#define CSN_BASE_SNAPSHOT_ADD_VER 202002010
+
/*
* Each relation is represented by a relinfo structure.
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..282bae882a 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -31,6 +31,7 @@
#include "rmgrdesc.h"
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
+#include "access/csn_log.h"
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask) \
{ name, desc, identify},
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 9b9611127d..cc5c51c53f 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -14,17 +14,59 @@
#include "access/xlog.h"
#include "utils/snapshot.h"
+/* XLOG stuff */
+#define XLOG_CSN_ASSIGNMENT 0x00
+#define XLOG_CSN_SETXIDCSN 0x10
+#define XLOG_CSN_ZEROPAGE 0x20
+#define XLOG_CSN_TRUNCATE 0x30
+
+/*
+ * We should log MAX generated CSN to wal, so that database will not generate
+ * a historical CSN after database restart. This may appear when system time
+ * turned back.
+ *
+ * However we can not log the MAX CSN every time it generated, if so it will
+ * cause too many wal expend, so we log it 5s more in the future.
+ *
+ * As a trade off, when this database restart, there will be 5s bad performance
+ * for time synchronization among sharding nodes.
+ *
+ * It looks like we can redefine this as a configure parameter, and the user
+ * can decide which way they prefer.
+ *
+ */
+#define CSN_ASSIGN_TIME_INTERVAL 5
+
+typedef struct xl_xidcsn_set
+{
+ XidCSN xidcsn;
+ TransactionId xtop; /* XID's top-level XID */
+ int nsubxacts; /* number of subtransaction XIDs */
+ TransactionId xsub[FLEXIBLE_ARRAY_MEMBER]; /* assigned subxids */
+} xl_xidcsn_set;
+
+#define MinSizeOfXidCSNSet offsetof(xl_xidcsn_set, xsub)
+#define CSNAddByNanosec(csn,second) (csn + second * 1000000000L)
+
extern void CSNLogSetCSN(TransactionId xid, int nsubxids,
- TransactionId *subxids, XidCSN csn);
+ TransactionId *subxids, XidCSN csn, bool write_xlog);
extern XidCSN CSNLogGetCSNByXid(TransactionId xid);
extern Size CSNLogShmemSize(void);
extern void CSNLogShmemInit(void);
extern void BootStrapCSNLog(void);
-extern void StartupCSNLog(TransactionId oldestActiveXID);
extern void ShutdownCSNLog(void);
extern void CheckPointCSNLog(void);
extern void ExtendCSNLog(TransactionId newestXact);
extern void TruncateCSNLog(TransactionId oldestXact);
+extern void csnlog_redo(XLogReaderState *record);
+extern void csnlog_desc(StringInfo buf, XLogReaderState *record);
+extern const char *csnlog_identify(uint8 info);
+extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
+extern void set_last_max_csn(XidCSN xidcsn);
+extern void set_last_log_wal_csn(XidCSN xidcsn);
+extern XidCSN get_last_log_wal_csn(void);
+extern void set_xmin_for_csn(void);
+
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 6c15df7e70..b2d12bfb27 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_CSNLOG_ID, "CSN", csnlog_redo, csnlog_desc, csnlog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 88f3d76700..02be3087ac 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -243,6 +243,7 @@ typedef struct xl_parameter_change
int wal_level;
bool wal_log_hints;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
} xl_parameter_change;
/* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..9e5d4b0fc0 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -181,6 +181,7 @@ typedef struct ControlFileData
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
+ bool enable_csn_snapshot;
/*
* This data is used to check for hardware-architecture compatibility of
0003-snapshot-switch.patchapplication/octet-stream; name=0003-snapshot-switch.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..7cd0d5e49f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9068,8 +9068,56 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</varlistentry>
</variablelist>
- </sect1>
+ <sect2 id="runtime-config-CSN-base-snapshot">
+ <title>CSN Based Snapshot</title>
+
+ <para>
+ By default, The snapshots in <productname>PostgreSQL</productname> uses the
+ XID (TransactionID) to identify the status of the transaction, the in-progress
+ transactions, and the future transactions for all its visibility calculations.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> also provides the CSN (commit-sequence-number)
+ based mechanism to identify the past-transactions and the ones that are yet to
+ be started/committed.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-enable-csn-snapshot" xreflabel="enable_csn_snapshot">
+ <term><varname>enable_csn_snapshot</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>enable_csn_snapshot</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+
+ <para>
+ Enable/disable the CSN based transaction visibility tracking for the snapshot.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> uses the clock timestamp as a CSN,
+ so enabling the CSN based snapshots can be useful for implementing the global
+ snapshots and global transaction visibility.
+ </para>
+
+ <para>
+ when enabled <productname>PostgreSQL</productname> creates
+ <filename>pg_csn</filename> directory under <envar>PGDATA</envar> to keep
+ the track of CSN and XID mappings.
+ </para>
+
+ <para>
+ The default value is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+ </sect2>
+ </sect1>
<sect1 id="runtime-config-compatible">
<title>Version and Platform Compatibility</title>
diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 9a42c5ba60..16c6856758 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -30,9 +30,28 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
+#include "storage/shmem.h"
bool enable_csn_snapshot;
+/*
+ * We use csnSnapshotActive to judge if csn snapshot enabled instead of by
+ * enable_csn_snapshot, this design is similar to 'track_commit_timestamp'.
+ *
+ * Because in process of replication if master change 'enable_csn_snapshot'
+ * in a database restart, standby should apply wal record for GUC changed,
+ * then it's difficult to notice all backends about that. So they can get
+ * the message by 'csnSnapshotActive' which in share buffer. It will not
+ * acquire a lock, so without performance issue.
+ *
+ */
+typedef struct CSNshapshotShared
+{
+ bool csnSnapshotActive;
+} CSNshapshotShared;
+
+CSNshapshotShared *csnShared = NULL;
+
/*
* Defines for CSNLog page sizes. A page is the same BLCKSZ as is used
* everywhere else in Postgres.
@@ -94,9 +113,6 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
int i = 0;
int offset = 0;
- /* Callers of CSNLogSetCSN() must check GUC params */
- Assert(enable_csn_snapshot);
-
Assert(TransactionIdIsValid(xid));
pageno = TransactionIdToPage(xid); /* get page of parent */
@@ -191,9 +207,6 @@ CSNLogGetCSNByXid(TransactionId xid)
int slotno;
XidCSN csn;
- /* Callers of CSNLogGetCSNByXid() must check GUC params */
- Assert(enable_csn_snapshot);
-
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
csn = *(XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
@@ -218,9 +231,6 @@ CSNLogShmemBuffers(void)
Size
CSNLogShmemSize(void)
{
- if (!enable_csn_snapshot)
- return 0;
-
return SimpleLruShmemSize(CSNLogShmemBuffers(), 0);
}
@@ -230,37 +240,25 @@ CSNLogShmemSize(void)
void
CSNLogShmemInit(void)
{
- if (!enable_csn_snapshot)
- return;
+ bool found;
+
CsnlogCtl->PagePrecedes = CSNLogPagePrecedes;
SimpleLruInit(CsnlogCtl, "CSNLog Ctl", CSNLogShmemBuffers(), 0,
CSNLogControlLock, "pg_csn", LWTRANCHE_CSN_LOG_BUFFERS);
+
+ csnShared = ShmemInitStruct("CSNlog shared",
+ sizeof(CSNshapshotShared),
+ &found);
}
/*
- * This func must be called ONCE on system install. It creates the initial
- * CSNLog segment. The pg_csn directory is assumed to have been
- * created by initdb, and CSNLogShmemInit must have been called already.
+ * See ActivateCSNlog
*/
void
BootStrapCSNLog(void)
{
- int slotno;
-
- if (!enable_csn_snapshot)
- return;
-
- LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
-
- /* Create and zero the first page of the commit log */
- slotno = ZeroCSNLogPage(0, false);
-
- /* Make sure it's written out */
- SimpleLruWritePage(CsnlogCtl, slotno);
- Assert(!CsnlogCtl->shared->page_dirty[slotno]);
-
- LWLockRelease(CSNLogControlLock);
+ return;
}
/*
@@ -288,13 +286,94 @@ ZeroTruncateCSNLogPage(int pageno, bool write_xlog)
SimpleLruTruncate(CsnlogCtl, pageno);
}
+void
+ActivateCSNlog(void)
+{
+ int startPage;
+ TransactionId nextXid = InvalidTransactionId;
+
+ if (csnShared->csnSnapshotActive)
+ return;
+
+
+ nextXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ startPage = TransactionIdToPage(nextXid);
+
+ /* Create the current segment file, if necessary */
+ if (!SimpleLruDoesPhysicalPageExist(CsnlogCtl, startPage))
+ {
+ int slotno;
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ slotno = ZeroCSNLogPage(startPage, false);
+ SimpleLruWritePage(CsnlogCtl, slotno);
+ LWLockRelease(CSNLogControlLock);
+ }
+ csnShared->csnSnapshotActive = true;
+}
+
+bool
+get_csnlog_status(void)
+{
+ if(!csnShared)
+ {
+ /* Should not arrived */
+ elog(ERROR, "We do not have csnShared point");
+ }
+ return csnShared->csnSnapshotActive;
+}
+
+void
+DeactivateCSNlog(void)
+{
+ csnShared->csnSnapshotActive = false;
+ LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
+ (void) SlruScanDirectory(CsnlogCtl, SlruScanDirCbDeleteAll, NULL);
+ LWLockRelease(CSNLogControlLock);
+}
+
+void
+StartupCSN(void)
+{
+ ActivateCSNlog();
+}
+
+void
+CompleteCSNInitialization(void)
+{
+ /*
+ * If the feature is not enabled, turn it off for good. This also removes
+ * any leftover data.
+ *
+ * Conversely, we activate the module if the feature is enabled. This is
+ * necessary for primary and standby as the activation depends on the
+ * control file contents at the beginning of recovery or when a
+ * XLOG_PARAMETER_CHANGE is replayed.
+ */
+ if (!get_csnlog_status())
+ DeactivateCSNlog();
+ else
+ ActivateCSNlog();
+}
+
+void
+CSNlogParameterChange(bool newvalue, bool oldvalue)
+{
+ if (newvalue)
+ {
+ if (!csnShared->csnSnapshotActive)
+ ActivateCSNlog();
+ }
+ else if (csnShared->csnSnapshotActive)
+ DeactivateCSNlog();
+}
+
/*
* This must be called ONCE during postmaster or standalone-backend shutdown
*/
void
ShutdownCSNLog(void)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -314,7 +393,7 @@ ShutdownCSNLog(void)
void
CheckPointCSNLog(void)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -342,7 +421,7 @@ ExtendCSNLog(TransactionId newestXact)
{
int pageno;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -373,9 +452,9 @@ ExtendCSNLog(TransactionId newestXact)
void
TruncateCSNLog(TransactionId oldestXact)
{
- int cutoffPage;
+ int cutoffPage;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/*
@@ -388,7 +467,6 @@ TruncateCSNLog(TransactionId oldestXact)
*/
TransactionIdRetreat(oldestXact);
cutoffPage = TransactionIdToPage(oldestXact);
-
ZeroTruncateCSNLogPage(cutoffPage, true);
}
@@ -447,7 +525,6 @@ WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
TransactionId *subxids, XidCSN csn)
{
xl_xidcsn_set xlrec;
- XLogRecPtr recptr;
xlrec.xtop = xid;
xlrec.nsubxacts = nsubxids;
@@ -456,8 +533,7 @@ WriteXidCsnXlogRec(TransactionId xid, int nsubxids,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, MinSizeOfXidCSNSet);
XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
- recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
- XLogFlush(recptr);
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_SETXIDCSN);
}
/*
@@ -477,12 +553,9 @@ WriteZeroCSNPageXlogRec(int pageno)
static void
WriteTruncateCSNXlogRec(int pageno)
{
- XLogRecPtr recptr;
- return;
XLogBeginInsert();
XLogRegisterData((char *) (&pageno), sizeof(int));
- recptr = XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
- XLogFlush(recptr);
+ XLogInsert(RM_CSNLOG_ID, XLOG_CSN_TRUNCATE);
}
diff --git a/src/backend/access/transam/csn_snapshot.c b/src/backend/access/transam/csn_snapshot.c
index 99e4a2f1ed..cedce60a6f 100644
--- a/src/backend/access/transam/csn_snapshot.c
+++ b/src/backend/access/transam/csn_snapshot.c
@@ -62,10 +62,7 @@ CSNSnapshotShmemSize(void)
{
Size size = 0;
- if (enable_csn_snapshot)
- {
- size += MAXALIGN(sizeof(CSNSnapshotState));
- }
+ size += MAXALIGN(sizeof(CSNSnapshotState));
return size;
}
@@ -76,17 +73,15 @@ CSNSnapshotShmemInit()
{
bool found;
- if (enable_csn_snapshot)
+ csnState = ShmemInitStruct("csnState",
+ sizeof(CSNSnapshotState),
+ &found);
+ if (!found)
{
- csnState = ShmemInitStruct("csnState",
- sizeof(CSNSnapshotState),
- &found);
- if (!found)
- {
- csnState->last_max_csn = 0;
- csnState->last_csn_log_wal = 0;
- SpinLockInit(&csnState->lock);
- }
+ csnState->last_max_csn = 0;
+ csnState->last_csn_log_wal = 0;
+ csnState->xmin_for_csn = InvalidTransactionId;
+ SpinLockInit(&csnState->lock);
}
}
@@ -104,7 +99,7 @@ GenerateCSN(bool locked)
instr_time current_time;
SnapshotCSN csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
/*
* TODO: create some macro that add small random shift to current time.
@@ -140,7 +135,7 @@ TransactionIdGetXidCSN(TransactionId xid)
{
XidCSN xid_csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
/* Handle permanent TransactionId's for which we don't have mapping */
if (!TransactionIdIsNormal(xid))
@@ -168,12 +163,20 @@ TransactionIdGetXidCSN(TransactionId xid)
SpinLockRelease(&csnState->lock);
}
- if ( FrozenTransactionId != xmin_for_csn ||
+ if (FrozenTransactionId == xmin_for_csn ||
TransactionIdPrecedes(xmin_for_csn, TransactionXmin))
{
xmin_for_csn = TransactionXmin;
}
+ /*
+ * For the xid with 'xid >= TransactionXmin and xid < xmin_for_csn',
+ * it defined as unclear csn which follow xid-snapshot result.
+ */
+ if(!TransactionIdPrecedes(xid, TransactionXmin) &&
+ TransactionIdPrecedes(xid, xmin_for_csn))
+ return UnclearCSN;
+
/*
* For xids which less then TransactionXmin CSNLog can be already
* trimmed but we know that such transaction is definetly not concurrently
@@ -222,7 +225,7 @@ XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
{
XidCSN csn;
- Assert(enable_csn_snapshot);
+ Assert(get_csnlog_status());
csn = TransactionIdGetXidCSN(xid);
@@ -238,6 +241,14 @@ XidInvisibleInCSNSnapshot(TransactionId xid, Snapshot snapshot)
/* It is bootstrap or frozen transaction */
return false;
}
+ else if(CSNIsUnclear(csn))
+ {
+ /*
+ * Some xid can not figure out csn because of snapshot switch,
+ * and we can follow xid-base result.
+ */
+ return true;
+ }
else
{
/* It is aborted or in-progress */
@@ -277,7 +288,7 @@ void
CSNSnapshotAbort(PGPROC *proc, TransactionId xid,
int nsubxids, TransactionId *subxids)
{
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
CSNLogSetCSN(xid, nsubxids, subxids, AbortedXidCSN, true);
@@ -310,7 +321,7 @@ CSNSnapshotPrecommit(PGPROC *proc, TransactionId xid,
XidCSN oldassignedXidCsn = InProgressXidCSN;
bool in_progress;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
/* Set InDoubt status if it is local transaction */
@@ -348,7 +359,7 @@ CSNSnapshotCommit(PGPROC *proc, TransactionId xid,
{
volatile XidCSN assigned_xid_csn;
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
return;
if (!TransactionIdIsValid(xid))
@@ -391,10 +402,42 @@ get_last_log_wal_csn(void)
}
/*
- * 'xmin_for_csn' for when turn xid-snapshot to csn-snapshot
+ * Rely on different value of enable and same we have different action.
*/
void
-set_xmin_for_csn(void)
+prepare_csn_env(bool enable, bool same, TransactionId *xmin_for_csn_in_control)
{
- csnState->xmin_for_csn = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ TransactionId nextxid = InvalidTransactionId;
+
+ if(enable)
+ {
+ if(same)
+ {
+ /*
+ * Database startup with no enable_csn_snapshot change and value is true,
+ * it can just transmit xmin_for_csn from pg_control to csnState->xmin_for_csn.
+ */
+ csnState->xmin_for_csn = *xmin_for_csn_in_control;
+ }
+ else
+ {
+ /*
+ * Last time database is xid-base snapshot, and now startup as csn-base snapshot,
+ * we should redifine a xmin_for_csn, and store it in both pg_control and
+ * csnState->xmin_for_csn.
+ */
+ nextxid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+ csnState->xmin_for_csn = nextxid;
+ *xmin_for_csn_in_control = nextxid;
+ /* produce the csnlog segment we want now and seek to current page */
+ ActivateCSNlog();
+ }
+ }
+ else
+ {
+ /* Try to drop all csnlog seg */
+ DeactivateCSNlog();
+ /* Clear xmin_for_csn in pg_control because we are xid-base snaposhot now. */
+ *xmin_for_csn_in_control = InvalidTransactionId;
+ }
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dc2e9ae874..32f1e614b4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,6 +79,7 @@
#include "utils/relmapper.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "access/csn_log.h"
extern uint32 bootstrap_data_checksum_version;
@@ -4609,6 +4610,7 @@ InitControlFile(uint64 sysidentifier)
ControlFile->track_commit_timestamp = track_commit_timestamp;
ControlFile->enable_csn_snapshot = enable_csn_snapshot;
ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+ ControlFile->xmin_for_csn = InvalidTransactionId;
}
static void
@@ -6805,6 +6807,9 @@ StartupXLOG(void)
if (ControlFile->track_commit_timestamp)
StartupCommitTs();
+ if(ControlFile->enable_csn_snapshot)
+ StartupCSN();
+
/*
* Recover knowledge about replay progress of known replication partners.
*/
@@ -7921,6 +7926,7 @@ StartupXLOG(void)
* commit timestamp.
*/
CompleteCommitTsInitialization();
+ CompleteCSNInitialization();
/*
* All done with end-of-recovery actions.
@@ -9727,6 +9733,9 @@ XLogRestorePoint(const char *rpName)
static void
XLogReportParameters(void)
{
+ TransactionId xmin_for_csn = InvalidTransactionId;
+
+ xmin_for_csn = ControlFile->xmin_for_csn;
if (wal_level != ControlFile->wal_level ||
wal_log_hints != ControlFile->wal_log_hints ||
MaxConnections != ControlFile->MaxConnections ||
@@ -9766,11 +9775,12 @@ XLogReportParameters(void)
XLogFlush(recptr);
}
+ prepare_csn_env(enable_csn_snapshot,
+ enable_csn_snapshot == ControlFile->enable_csn_snapshot,
+ &xmin_for_csn);
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
- if (enable_csn_snapshot != ControlFile->enable_csn_snapshot)
- set_xmin_for_csn();
-
+ ControlFile->xmin_for_csn = xmin_for_csn;
ControlFile->MaxConnections = MaxConnections;
ControlFile->max_worker_processes = max_worker_processes;
ControlFile->max_wal_senders = max_wal_senders;
@@ -9784,6 +9794,16 @@ XLogReportParameters(void)
LWLockRelease(ControlFileLock);
}
+ else
+ {
+ /*
+ * When no GUC change, but for xmin_for_csn it should transmit the xmin_for_csn
+ * from pg_control to csnState->xmin_for_csn. Or it will cause issue when prepare
+ * transaction exixts and with 'xid-snapshot start -> csn-snapshot start ->
+ * csn-snapshot start' sequence.
+ */
+ prepare_csn_env(enable_csn_snapshot, true, &xmin_for_csn);
+ }
}
/*
@@ -10212,6 +10232,8 @@ xlog_redo(XLogReaderState *record)
CommitTsParameterChange(xlrec.track_commit_timestamp,
ControlFile->track_commit_timestamp);
ControlFile->track_commit_timestamp = xlrec.track_commit_timestamp;
+ CSNlogParameterChange(xlrec.enable_csn_snapshot,
+ ControlFile->enable_csn_snapshot);
ControlFile->enable_csn_snapshot = xlrec.enable_csn_snapshot;
UpdateControlFile();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 9283021c7b..e326b431c2 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1734,7 +1734,7 @@ GetSnapshotData(Snapshot snapshot)
* Take XidCSN under ProcArrayLock so the snapshot stays
* synchronized.
*/
- if (!snapshot->takenDuringRecovery && enable_csn_snapshot)
+ if (!snapshot->takenDuringRecovery && get_csnlog_status())
xid_csn = GenerateCSN(false);
LWLockRelease(ProcArrayLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1e9bcc7aee..4527dda0ee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1160,7 +1160,7 @@ static struct config_bool ConfigureNamesBool[] =
gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
},
&enable_csn_snapshot,
- true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+ false,
NULL, NULL, NULL
},
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..8b02ec8200 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -296,6 +296,8 @@
# (change requires restart)
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
+#enable_csn_snapshot = off # enable csn base snapshot
+ # (change requires restart)
# - Primary Server -
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5fa195b913..2a31366930 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -52,6 +52,7 @@
#include "access/transam.h"
#include "access/xact.h"
#include "access/xlog.h"
+#include "access/csn_log.h"
#include "catalog/catalog.h"
#include "lib/pairingheap.h"
#include "miscadmin.h"
@@ -2244,7 +2245,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
- if (!enable_csn_snapshot)
+ if (!get_csnlog_status())
{
Assert(XidCSNIsFrozen(snapshot->snapshot_csn));
return in_snapshot;
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index d7feca0b38..5d508caef5 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -67,6 +67,13 @@ extern void WriteAssignCSNXlogRec(XidCSN xidcsn);
extern void set_last_max_csn(XidCSN xidcsn);
extern void set_last_log_wal_csn(XidCSN xidcsn);
extern XidCSN get_last_log_wal_csn(void);
-extern void set_xmin_for_csn(void);
+extern void prepare_csn_env(bool enable, bool same, TransactionId *xmin_for_csn_in_control);
+extern void CatchCSNLog(void);
+extern void ActivateCSNlog(void);
+extern void DeactivateCSNlog(void);
+extern void StartupCSN(void);
+extern void CompleteCSNInitialization(void);
+extern void CSNlogParameterChange(bool newvalue, bool oldvalue);
+extern bool get_csnlog_status(void);
#endif /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/access/csn_snapshot.h b/src/include/access/csn_snapshot.h
index 1894586204..a768f054f5 100644
--- a/src/include/access/csn_snapshot.h
+++ b/src/include/access/csn_snapshot.h
@@ -24,17 +24,19 @@
*/
typedef pg_atomic_uint64 CSN_atomic;
-#define InProgressXidCSN UINT64CONST(0x0)
-#define AbortedXidCSN UINT64CONST(0x1)
+#define InProgressXidCSN UINT64CONST(0x0)
+#define AbortedXidCSN UINT64CONST(0x1)
#define FrozenXidCSN UINT64CONST(0x2)
-#define InDoubtXidCSN UINT64CONST(0x3)
-#define FirstNormalXidCSN UINT64CONST(0x4)
+#define InDoubtXidCSN UINT64CONST(0x3)
+#define UnclearCSN UINT64CONST(0x4)
+#define FirstNormalXidCSN UINT64CONST(0x5)
-#define XidCSNIsInProgress(csn) ((csn) == InProgressXidCSN)
+#define XidCSNIsInProgress(csn) ((csn) == InProgressXidCSN)
#define XidCSNIsAborted(csn) ((csn) == AbortedXidCSN)
-#define XidCSNIsFrozen(csn) ((csn) == FrozenXidCSN)
+#define XidCSNIsFrozen(csn) ((csn) == FrozenXidCSN)
#define XidCSNIsInDoubt(csn) ((csn) == InDoubtXidCSN)
-#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
+#define CSNIsUnclear(csn) ((csn) == UnclearCSN)
+#define XidCSNIsNormal(csn) ((csn) >= FirstNormalXidCSN)
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 9e5d4b0fc0..3ff7371a92 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -183,6 +183,12 @@ typedef struct ControlFileData
bool track_commit_timestamp;
bool enable_csn_snapshot;
+ /*
+ * Used to record a xmin when database startup with a snapshot-switch to csn snapshot,
+ * and will hold the value until it switch to xid-snapshot.
+ */
+ TransactionId xmin_for_csn;
+
/*
* This data is used to check for hardware-architecture compatibility of
* the database and the backend executable. We need not check endianness
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 29de73c060..86e114e934 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
SUBDIRS = \
brin \
commit_ts \
+ csnsnapshot \
dummy_index_am \
dummy_seclabel \
snapshot_too_old \
diff --git a/src/test/modules/csnsnapshot/Makefile b/src/test/modules/csnsnapshot/Makefile
new file mode 100644
index 0000000000..45c4221cd0
--- /dev/null
+++ b/src/test/modules/csnsnapshot/Makefile
@@ -0,0 +1,18 @@
+# src/test/modules/csnsnapshot/Makefile
+
+REGRESS = csnsnapshot
+REGRESS_OPTS = --temp-config=$(top_srcdir)/src/test/modules/csnsnapshot/csn_snapshot.conf
+NO_INSTALLCHECK = 1
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/csnsnapshot
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/csnsnapshot/csn_snapshot.conf b/src/test/modules/csnsnapshot/csn_snapshot.conf
new file mode 100644
index 0000000000..e9d3c35756
--- /dev/null
+++ b/src/test/modules/csnsnapshot/csn_snapshot.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
diff --git a/src/test/modules/csnsnapshot/expected/csnsnapshot.out b/src/test/modules/csnsnapshot/expected/csnsnapshot.out
new file mode 100644
index 0000000000..ac28e417b6
--- /dev/null
+++ b/src/test/modules/csnsnapshot/expected/csnsnapshot.out
@@ -0,0 +1 @@
+create table t1(i int, j int, k varchar);
diff --git a/src/test/modules/csnsnapshot/sql/csnsnapshot.sql b/src/test/modules/csnsnapshot/sql/csnsnapshot.sql
new file mode 100644
index 0000000000..91539b8c30
--- /dev/null
+++ b/src/test/modules/csnsnapshot/sql/csnsnapshot.sql
@@ -0,0 +1 @@
+create table t1(i int, j int, k varchar);
\ No newline at end of file
diff --git a/src/test/modules/csnsnapshot/t/001_base.pl b/src/test/modules/csnsnapshot/t/001_base.pl
new file mode 100644
index 0000000000..1c91f4d9f7
--- /dev/null
+++ b/src/test/modules/csnsnapshot/t/001_base.pl
@@ -0,0 +1,102 @@
+# Single-node test: value can be set, and is still present after recovery
+
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 5;
+use PostgresNode;
+
+my $node = get_new_node('csntest');
+$node->init;
+$node->append_conf('postgresql.conf', qq{
+ enable_csn_snapshot = on
+ csn_snapshot_defer_time = 10
+ max_prepared_transactions = 10
+ });
+$node->start;
+
+my $test_1 = 1;
+
+# Create a table
+$node->safe_psql('postgres', 'create table t1(i int, j int)');
+
+# insert test record
+$node->safe_psql('postgres', 'insert into t1 values(1,1)');
+# export csn snapshot
+my $test_snapshot = $node->safe_psql('postgres', 'select pg_csn_snapshot_export()');
+# insert test record
+$node->safe_psql('postgres', 'insert into t1 values(2,1)');
+
+my $count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '2', 'Get right number in nomal query');
+my $count2 = $node->safe_psql('postgres', "
+ begin transaction isolation level repeatable read;
+ select pg_csn_snapshot_import($test_snapshot);
+ select count(*) from t1;
+ commit;"
+ );
+
+is($count2, '
+1', 'Get right number in csn import query');
+
+#prepare transaction test
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(3,1);
+ insert into t1 values(3,2);
+ prepare transaction 'pt3';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(4,1);
+ insert into t1 values(4,2);
+ prepare transaction 'pt4';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(5,1);
+ insert into t1 values(5,2);
+ prepare transaction 'pt5';
+ ");
+$node->safe_psql('postgres', "
+ begin;
+ insert into t1 values(6,1);
+ insert into t1 values(6,2);
+ prepare transaction 'pt6';
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt4';");
+
+# restart with enable_csn_snapshot off
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = off");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(7,1);
+ insert into t1 values(7,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt3';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '8', 'Get right number in nomal query');
+
+
+# restart with enable_csn_snapshot on
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = on");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(8,1);
+ insert into t1 values(8,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt5';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '12', 'Get right number in nomal query');
+
+# restart with enable_csn_snapshot off
+$node->append_conf('postgresql.conf', "enable_csn_snapshot = on");
+$node->restart;
+$node->safe_psql('postgres', "
+ insert into t1 values(9,1);
+ insert into t1 values(9,2);
+ ");
+$node->safe_psql('postgres', "commit prepared 'pt6';");
+$count1 = $node->safe_psql('postgres', "select count(*) from t1");
+is($count1, '16', 'Get right number in nomal query');
diff --git a/src/test/modules/csnsnapshot/t/002_standby.pl b/src/test/modules/csnsnapshot/t/002_standby.pl
new file mode 100644
index 0000000000..b7c4ea93b2
--- /dev/null
+++ b/src/test/modules/csnsnapshot/t/002_standby.pl
@@ -0,0 +1,66 @@
+# Test simple scenario involving a standby
+
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 6;
+use PostgresNode;
+
+my $bkplabel = 'backup';
+my $master = get_new_node('master');
+$master->init(allows_streaming => 1);
+
+$master->append_conf(
+ 'postgresql.conf', qq{
+ enable_csn_snapshot = on
+ max_wal_senders = 5
+ });
+$master->start;
+$master->backup($bkplabel);
+
+my $standby = get_new_node('standby');
+$standby->init_from_backup($master, $bkplabel, has_streaming => 1);
+$standby->start;
+
+$master->safe_psql('postgres', "create table t1(i int, j int)");
+
+my $guc_on_master = $master->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_master, 'on', "GUC on master");
+
+my $guc_on_standby = $standby->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_standby, 'on', "GUC on standby");
+
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = off');
+$master->restart;
+
+$guc_on_master = $master->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_master, 'off', "GUC off master");
+
+$guc_on_standby = $standby->safe_psql('postgres', 'show enable_csn_snapshot');
+is($guc_on_standby, 'on', "GUC on standby");
+
+# We consume a large number of transaction,for skip page
+for my $i (1 .. 4096) #4096
+{
+ $master->safe_psql('postgres', "insert into t1 values(1,$i)");
+}
+$master->safe_psql('postgres', "select pg_sleep(2)");
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = on');
+$master->restart;
+
+my $count_standby = $standby->safe_psql('postgres', 'select count(*) from t1');
+is($count_standby, '4096', "Ok for siwtch xid-base > csn-base"); #4096
+
+# We consume a large number of transaction,for skip page
+for my $i (1 .. 4096) #4096
+{
+ $master->safe_psql('postgres', "insert into t1 values(1,$i)");
+}
+$master->safe_psql('postgres', "select pg_sleep(2)");
+
+$master->append_conf('postgresql.conf', 'enable_csn_snapshot = off');
+$master->restart;
+
+$count_standby = $standby->safe_psql('postgres', 'select count(*) from t1');
+is($count_standby, '8192', "Ok for siwtch csn-base > xid-base"); #8192
\ No newline at end of file
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index da2e5aa38b..cc169a1999 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,7 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
enable_bitmapscan | on
- enable_csn_snapshot | on
+ enable_csn_snapshot | off
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on