tracking commit timestamps

Started by Alvaro Herreraabout 12 years ago157 messages
#1Alvaro Herrera
alvherre@2ndquadrant.com
1 attachment(s)

Hi,

There has been some interest in keeping track of timestamp of
transaction commits. This patch implements that.

There are some seemingly curious choices here. First, this module can
be disabled, and in fact it's turned off by default. At startup, we
verify whether it's enabled, and create the necessary SLRU segments if
so. And if the server is started with this disabled, we set the oldest
value we know about to avoid trying to read the commit TS of
transactions of which we didn't keep record. The ability to turn this
off is there to avoid imposing the overhead on systems that don't need
this feature.

Another thing of note is that we allow for some extra data alongside the
timestamp proper. This might be useful for a replication system that
wants to keep track of the origin node ID of a committed transaction,
for example. Exactly what will we do with the bit space we have is
unclear, so I have kept it generic and called it "commit extra data".

This offers the chance for outside modules to set the commit TS of a
transaction; there is support for WAL-logging such values. But the core
user of the feature (RecordTransactionCommit) doesn't use it, because
xact.c's WAL logging itself is enough. For systems that are replicating
transactions from remote nodes, it is useful.

We also keep track of the latest committed transaction. This is
supposed to be useful to calculate replication lag.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 9,14 ****
--- 9,15 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/gin.h"
  #include "access/gist_private.h"
  #include "access/hash.h"
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2257,2262 **** include 'filename'
--- 2257,2277 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+       <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+       <indexterm>
+        <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Record commit time of transactions.  This parameter
+         can only be set in
+         the <filename>postgresql.conf</> file or on the server command line.
+         The default value is off.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       </variablelist>
      </sect2>
  
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 8,14 **** subdir = src/backend/access/rmgrdesc
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
  	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
  	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
  
--- 8,15 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o \
!        heapdesc.o \
  	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
  	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
  
*** /dev/null
--- b/src/backend/access/rmgrdesc/committsdesc.c
***************
*** 0 ****
--- 1,53 ----
+ /*-------------------------------------------------------------------------
+  *
+  * committsdesc.c
+  *    rmgr descriptor routines for access/transam/committs.c
+  *
+  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *    src/backend/access/rmgrdesc/committsdesc.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "access/committs.h"
+ #include "utils/timestamp.h"
+ 
+ 
+ void
+ committs_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ 	uint8		info = xl_info & ~XLR_INFO_MASK;
+ 
+ 	if (info == COMMITTS_ZEROPAGE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(&pageno, rec, sizeof(int));
+ 		appendStringInfo(buf, "zeropage: %d", pageno);
+ 	}
+ 	else if (info == COMMITTS_TRUNCATE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(&pageno, rec, sizeof(int));
+ 		appendStringInfo(buf, "truncate before: %d", pageno);
+ 	}
+ 	else if (info == COMMITTS_SETTS)
+ 	{
+ 		xl_committs_set *xlrec = (xl_committs_set *) rec;
+ 		int		i;
+ 
+ 		appendStringInfo(buf, "set committs %s for: %u",
+ 						 timestamptz_to_str(xlrec->timestamp),
+ 						 xlrec->mainxid);
+ 		for (i = 0; i < xlrec->nsubxids; i++)
+ 			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+ 	}
+ 	else
+ 		appendStringInfo(buf, "UNKNOWN");
+ }
*** a/src/backend/access/rmgrdesc/xlogdesc.c
--- b/src/backend/access/rmgrdesc/xlogdesc.c
***************
*** 44,50 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
! 						 "oldest running xid %u; %s",
  				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->PrevTimeLineID,
--- 44,50 ----
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
! 						 "oldest CommitTs xid: %u; oldest running xid %u; %s",
  				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->PrevTimeLineID,
***************
*** 57,62 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 57,63 ----
  						 checkpoint->oldestXidDB,
  						 checkpoint->oldestMulti,
  						 checkpoint->oldestMultiDB,
+ 						 checkpoint->oldestCommitTs,
  						 checkpoint->oldestActiveXid,
  				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
  	}
*** a/src/backend/access/transam/Makefile
--- b/src/backend/access/transam/Makefile
***************
*** 14,20 **** include $(top_builddir)/src/Makefile.global
  
  OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
  	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
! 	xlogreader.o xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 14,20 ----
  
  OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
  	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
! 	xlogreader.o xlogutils.o committs.o
  
  include $(top_srcdir)/src/backend/common.mk
  
*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 152,159 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
  		   status == TRANSACTION_STATUS_ABORTED);
  
  	/*
! 	 * See how many subxids, if any, are on the same page as the parent, if
! 	 * any.
  	 */
  	for (i = 0; i < nsubxids; i++)
  	{
--- 152,158 ----
  		   status == TRANSACTION_STATUS_ABORTED);
  
  	/*
! 	 * See how many subxids, if any, are on the same page as the parent.
  	 */
  	for (i = 0; i < nsubxids; i++)
  	{
*** /dev/null
--- b/src/backend/access/transam/committs.c
***************
*** 0 ****
--- 1,819 ----
+ /*-------------------------------------------------------------------------
+  *
+  * committs.c
+  *		PostgreSQL commit timestamp manager
+  *
+  * This module is a pg_clog-like system that stores the commit timestamp
+  * for each transaction.
+  *
+  * XLOG interactions: this module generates an XLOG record whenever a new
+  * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+  * generated for setting of values when the caller requests it; this allows
+  * us to support values coming from places other than transaction commit.
+  * Other writes of CommitTS come from recording of transaction commit in
+  * xact.c, which generates its own XLOG records for these events and will
+  * re-perform the status update on redo; so we need make no additional XLOG
+  * entry here.
+  *
+  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/backend/access/transam/committs.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "access/committs.h"
+ #include "access/htup_details.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "miscadmin.h"
+ #include "pg_trace.h"
+ #include "utils/builtins.h"
+ #include "utils/snapmgr.h"
+ #include "utils/timestamp.h"
+ 
+ /*
+  * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+  * everywhere else in Postgres.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * CommitTs page numbering also wraps around at
+  * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+  * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+  * explicit notice of that fact in this module, except when comparing segment
+  * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+  */
+ 
+ /* We need 8+4 bytes per xact */
+ typedef struct CommitTimestampEntry
+ {
+ 	TimestampTz		time;
+ 	CommitExtraData	extra;
+ } CommitTimestampEntry;
+ 
+ #define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, extra) + \
+ 									sizeof(CommitExtraData))
+ 
+ #define COMMITTS_XACTS_PER_PAGE \
+ 	(BLCKSZ / SizeOfCommitTimestampEntry)
+ 
+ #define TransactionIdToCTsPage(xid)	\
+ 	((xid) / (TransactionId) COMMITTS_XACTS_PER_PAGE)
+ #define TransactionIdToCTsEntry(xid)	\
+ 	((xid) % (TransactionId) COMMITTS_XACTS_PER_PAGE)
+ 
+ /*
+  * Link to shared-memory data structures for CLOG control
+  */
+ static SlruCtlData CommitTsCtlData;
+ 
+ #define CommitTsCtl (&CommitTsCtlData)
+ 
+ /*
+  * We keep a cache of the last value set in shared memory.  This is protected
+  * by CommitTsLock.
+  */
+ typedef struct CommitTimestampShared
+ {
+ 	TransactionId	xidLastCommit;
+ 	CommitTimestampEntry dataLastCommit;
+ } CommitTimestampShared;
+ 
+ CommitTimestampShared	*commitTsShared;
+ 
+ 
+ /* GUC variables */
+ bool	commit_ts_enabled;
+ 
+ static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+ 					 TransactionId *subxids, TimestampTz committs,
+ 					 CommitExtraData extra, int pageno);
+ static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+ 						  CommitExtraData extra, int slotno);
+ static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+ static bool CommitTsPagePrecedes(int page1, int page2);
+ static void WriteZeroPageXlogRec(int pageno);
+ static void WriteTruncateXlogRec(int pageno);
+ static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+ 						 TransactionId *subxids, TimestampTz timestamp,
+ 						 CommitExtraData data);
+ 
+ 
+ /*
+  * TransactionTreeSetCommitTimestamp
+  *
+  * Record the final commit timestamp of transaction entries in the commit log
+  * for a transaction and its subtransaction tree, as efficiently as possible.
+  *
+  * xid is the top level transaction id.
+  *
+  * subxids is an array of xids of length nsubxids, representing subtransactions
+  * in the tree of xid. In various cases nsubxids may be zero.
+  *
+  * The do_xlog parameter tells us whether to include a XLog record of this
+  * or not.  Normal path through RecordTransactionCommit() will be related
+  * to a transaction commit XLog record, and so should pass "false" here.
+  * Other callers probably want to pass true, so that the given values persist
+  * in case of crashes.
+  */
+ void
+ TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+ 								  TransactionId *subxids, TimestampTz timestamp,
+ 								  CommitExtraData extra, bool do_xlog)
+ {
+ 	int			i;
+ 	TransactionId headxid;
+ 
+ 	if (!commit_ts_enabled)
+ 		return;
+ 
+ 	/*
+ 	 * Comply with the WAL-before-data rule: if caller specified it wants
+ 	 * this value to be recorded in WAL, do so before touching the data.
+ 	 */
+ 	if (do_xlog)
+ 		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, extra);
+ 
+ 	/*
+ 	 * We split the xids to set the timestamp to in groups belonging to the
+ 	 * same SLRU page; the first element in each such set is its head.  The
+ 	 * first group has the main XID as the head; subsequent sets use the
+ 	 * first subxid not on the previous page as head.  This way, we only have
+ 	 * to lock/modify each SLRU page once.
+ 	 */
+ 	for (i = 0, headxid = xid;;)
+ 	{
+ 		int			pageno = TransactionIdToCTsPage(headxid);
+ 		int			j;
+ 
+ 		for (j = i; j < nsubxids; j++)
+ 		{
+ 			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+ 				break;
+ 		}
+ 		/* subxids[i..j] are on the same page as the head */
+ 
+ 		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, extra,
+ 							 pageno);
+ 
+ 		/* if we wrote out all subxids, we're done. */
+ 		if (j + 1 >= nsubxids)
+ 			break;
+ 
+ 		/*
+ 		 * Set the new head and skip over it, as well as over the subxids
+ 		 * we just wrote.
+ 		 */
+ 		headxid = subxids[j];
+ 		i += j - i + 1;
+ 	}
+ 
+ 	/*
+ 	 * Update the cached value in shared memory
+ 	 */
+ 	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+ 	commitTsShared->xidLastCommit = xid;
+ 	commitTsShared->dataLastCommit.time = timestamp;
+ 	commitTsShared->dataLastCommit.extra = extra;
+ 	LWLockRelease(CommitTsLock);
+ }
+ 
+ /*
+  * Record the commit timestamp of transaction entries in the commit log for all
+  * entries on a single page.  Atomic only on this page.
+  */
+ static void
+ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+ 					 TransactionId *subxids, TimestampTz committs,
+ 					 CommitExtraData extra, int pageno)
+ {
+ 	int			slotno;
+ 	int			i;
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+ 
+ 	TransactionIdSetCommitTs(xid, committs, extra, slotno);
+ 	for (i = 0; i < nsubxids; i++)
+ 		TransactionIdSetCommitTs(subxids[i], committs, extra, slotno);
+ 
+ 	CommitTsCtl->shared->page_dirty[slotno] = true;
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Sets the commit timestamp of a single transaction.
+  *
+  * Must be called with CommitTsControlLock held
+  */
+ static void
+ TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+ 						 CommitExtraData extra, int slotno)
+ {
+ 	int			entryno = TransactionIdToCTsEntry(xid);
+ 	CommitTimestampEntry *entry;
+ 
+ 	entry = (CommitTimestampEntry *)
+ 		(CommitTsCtl->shared->page_buffer[slotno] +
+ 		 SizeOfCommitTimestampEntry * entryno);
+ 
+ 	entry->time = committs;
+ 	entry->extra = extra;
+ }
+ 
+ /*
+  * Interrogate the commit timestamp of a transaction.
+  */
+ void
+ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+ 							 CommitExtraData *data)
+ {
+ 	int			pageno = TransactionIdToCTsPage(xid);
+ 	int			entryno = TransactionIdToCTsEntry(xid);
+ 	int			slotno;
+ 	CommitTimestampEntry *entry;
+ 	TransactionId oldestCommitTs;
+ 
+ 	/* Return empty if module not enabled */
+ 	if (!commit_ts_enabled)
+ 	{
+ 		if (ts)
+ 			*ts = InvalidTransactionId;
+ 		if (data)
+ 			*data = (CommitExtraData) 0;
+ 		return;
+ 	}
+ 
+ 	/* Also return empty if the requested value is older than what we have */
+ 	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+ 	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+ 	LWLockRelease(CommitTsControlLock);
+ 
+ 	if (!TransactionIdIsValid(oldestCommitTs) ||
+ 		TransactionIdPrecedes(xid, oldestCommitTs))
+ 	{
+ 		if (ts)
+ 			*ts = InvalidTransactionId;
+ 		if (data)
+ 			*data = (CommitExtraData) 0;
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * Use an unlocked atomic read on our cached value in shared memory;
+ 	 * if it's a hit, acquire a lock and read the data, after verifying
+ 	 * that it's still what we initially read.  Otherwise, fall through
+ 	 * to read from SLRU.
+ 	 */
+ 	if (commitTsShared->xidLastCommit == xid)
+ 	{
+ 		LWLockAcquire(CommitTsLock, LW_SHARED);
+ 		if (commitTsShared->xidLastCommit == xid)
+ 		{
+ 			if (ts)
+ 				*ts = commitTsShared->dataLastCommit.time;
+ 			if (data)
+ 				*data = commitTsShared->dataLastCommit.extra;
+ 			LWLockRelease(CommitTsLock);
+ 			return;
+ 		}
+ 		LWLockRelease(CommitTsLock);
+ 	}
+ 
+ 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+ 	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ 	entry = (CommitTimestampEntry *)
+ 		(CommitTsCtl->shared->page_buffer[slotno] +
+ 		 SizeOfCommitTimestampEntry * entryno);
+ 
+ 	if (ts)
+ 		*ts = entry->time;
+ 
+ 	if (data)
+ 		*data = entry->extra;
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Return the Xid of the latest committed transaction.  (As far as this module
+  * is concerned, anyway; it's up to the caller to ensure the value is useful
+  * for its purposes.)
+  *
+  * ts and extra are filled with the corresponding data; they can be passed
+  * as NULL if not wanted.
+  */
+ TransactionId
+ GetLatestCommitTimestampData(TimestampTz *ts, CommitExtraData *extra)
+ {
+ 	TransactionId	xid;
+ 
+ 	/* Return empty if module not enabled */
+ 	if (!commit_ts_enabled)
+ 	{
+ 		if (ts)
+ 			*ts = InvalidTransactionId;
+ 		if (extra)
+ 			*extra = (CommitExtraData) 0;
+ 		return InvalidTransactionId;
+ 	}
+ 
+ 	LWLockAcquire(CommitTsLock, LW_SHARED);
+ 	xid = commitTsShared->xidLastCommit;
+ 	if (ts)
+ 		*ts = commitTsShared->dataLastCommit.time;
+ 	if (extra)
+ 		*extra = commitTsShared->dataLastCommit.extra;
+ 	LWLockRelease(CommitTsLock);
+ 
+ 	return xid;
+ }
+ 
+ /*
+  * SQL-callable wrapper to obtain commit time of a transaction
+  */
+ PG_FUNCTION_INFO_V1(pg_get_transaction_committime);
+ Datum
+ pg_get_transaction_committime(PG_FUNCTION_ARGS)
+ {
+ 	TransactionId	xid = PG_GETARG_UINT32(0);
+ 	TimestampTz		committs;
+ 
+ 	TransactionIdGetCommitTsData(xid, &committs, NULL);
+ 
+ 	PG_RETURN_TIMESTAMPTZ(committs);
+ }
+ 
+ PG_FUNCTION_INFO_V1(pg_get_transaction_extradata);
+ Datum
+ pg_get_transaction_extradata(PG_FUNCTION_ARGS)
+ {
+ 	TransactionId	xid = PG_GETARG_UINT32(0);
+ 	CommitExtraData	data;
+ 
+ 	TransactionIdGetCommitTsData(xid, NULL, &data);
+ 
+ 	PG_RETURN_INT32(data);
+ }
+ 
+ PG_FUNCTION_INFO_V1(pg_get_transaction_committime_data);
+ Datum
+ pg_get_transaction_committime_data(PG_FUNCTION_ARGS)
+ {
+ 	TransactionId	xid = PG_GETARG_UINT32(0);
+ 	TimestampTz		committs;
+ 	CommitExtraData	data;
+ 	Datum       values[2];
+ 	bool        nulls[2];
+ 	TupleDesc   tupdesc;
+ 	HeapTuple	htup;
+ 
+ 	/*
+ 	 * Construct a tuple descriptor for the result row.  This must match this
+ 	 * function's pg_proc entry!
+ 	 */
+ 	tupdesc = CreateTemplateTupleDesc(2, false);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "timestamp",
+ 					   TIMESTAMPTZOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "extra",
+ 					   INT4OID, -1, 0);
+ 	tupdesc = BlessTupleDesc(tupdesc);
+ 
+ 	/* and construct a tuple with our data */
+ 	TransactionIdGetCommitTsData(xid, &committs, &data);
+ 
+ 	values[0] = TimestampTzGetDatum(committs);
+ 	nulls[0] = false;
+ 
+ 	values[1] = Int32GetDatum(data);
+ 	nulls[1] = false;
+ 
+ 	htup = heap_form_tuple(tupdesc, values, nulls);
+ 
+ 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+ 
+ PG_FUNCTION_INFO_V1(pg_get_latest_transaction_committime_data);
+ Datum
+ pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS)
+ {
+ 	TransactionId	xid;
+ 	TimestampTz		committs;
+ 	CommitExtraData	data;
+ 	Datum       values[3];
+ 	bool        nulls[3];
+ 	TupleDesc   tupdesc;
+ 	HeapTuple	htup;
+ 
+ 	/*
+ 	 * Construct a tuple descriptor for the result row.  This must match this
+ 	 * function's pg_proc entry!
+ 	 */
+ 	tupdesc = CreateTemplateTupleDesc(3, false);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+ 					   XIDOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+ 					   TIMESTAMPTZOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "extra",
+ 					   INT4OID, -1, 0);
+ 	tupdesc = BlessTupleDesc(tupdesc);
+ 
+ 	/* and construct a tuple with our data */
+ 	xid = GetLatestCommitTimestampData(&committs, &data);
+ 
+ 	values[0] = TransactionIdGetDatum(xid);
+ 	nulls[0] = false;
+ 
+ 	values[1] = TimestampTzGetDatum(committs);
+ 	nulls[1] = false;
+ 
+ 	values[2] = Int32GetDatum(data);
+ 	nulls[2] = false;
+ 
+ 	htup = heap_form_tuple(tupdesc, values, nulls);
+ 
+ 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+ 
+ /*
+  * Number of shared CommitTS buffers.
+  *
+  * We use a very similar logic as for the number of CLOG buffers; see comments
+  * in CLOGShmemBuffers.
+  */
+ Size
+ CommitTsShmemBuffers(void)
+ {
+ 	return Min(16, Max(4, NBuffers / 1024));
+ }
+ 
+ /*
+  * Initialization of shared memory for CommitTs
+  */
+ Size
+ CommitTsShmemSize(void)
+ {
+ 	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+ 		sizeof(CommitTimestampShared);
+ }
+ 
+ void
+ CommitTsShmemInit(void)
+ {
+ 	bool	found;
+ 
+ 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+ 	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+ 				  CommitTsControlLock, "pg_committs");
+ 
+ 	commitTsShared = ShmemInitStruct("CommitTs shared",
+ 									 sizeof(CommitTimestampShared),
+ 									 &found);
+ 
+ 	if (!IsUnderPostmaster)
+ 	{
+ 		Assert(!found);
+ 
+ 		commitTsShared->xidLastCommit = InvalidTransactionId;
+ 		commitTsShared->dataLastCommit.time = 0;
+ 		commitTsShared->dataLastCommit.extra = 0;
+ 	}
+ 	else
+ 		Assert(found);
+ }
+ 
+ /*
+  * This function must be called ONCE on system install.
+  *
+  * (The CommitTs directory is assumed to have been created by initdb, and
+  * CommitTsShmemInit must have been called already.)
+  */
+ void
+ BootStrapCommitTs(void)
+ {
+ 	/*
+ 	 * Nothing to do here at present, unlike most other SLRU modules; segments
+ 	 * are created when the server is started with this module enabled.
+ 	 * See StartupCommitTs.
+ 	 */
+ }
+ 
+ /*
+  * Initialize (or reinitialize) a page of CommitTs to zeroes.
+  * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+  *
+  * The page is not actually written, just set up in shared memory.
+  * The slot number of the new page is returned.
+  *
+  * Control lock must be held at entry, and will be held at exit.
+  */
+ static int
+ ZeroCommitTsPage(int pageno, bool writeXlog)
+ {
+ 	int			slotno;
+ 
+ 	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+ 
+ 	if (writeXlog)
+ 		WriteZeroPageXlogRec(pageno);
+ 
+ 	return slotno;
+ }
+ 
+ /*
+  * This must be called ONCE during postmaster or standalone-backend startup,
+  * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+  *
+  * This is in charge of creating the currently active segment, if it's not
+  * already there.  The reason for this is that the server might have been
+  * running with this module disabled for a while and thus might have skipped
+  * the normal creation point.
+  */
+ void
+ StartupCommitTs(void)
+ {
+ 	TransactionId xid = ShmemVariableCache->nextXid;
+ 	int			pageno = TransactionIdToCTsPage(xid);
+ 	SlruCtl		ctl = CommitTsCtl;
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * Initialize our idea of the latest page number.
+ 	 */
+ 	CommitTsCtl->shared->latest_page_number = pageno;
+ 
+ 	/*
+ 	 * If this module is not currently enabled, make sure we don't hand back
+ 	 * possibly-invalid data; also remove segments of old data.
+ 	 */
+ 	if (!commit_ts_enabled)
+ 	{
+ 		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+ 		LWLockRelease(CommitTsControlLock);
+ 
+ 		TruncateCommitTs(ReadNewTransactionId());
+ 
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+ 	 * need to set the oldest value to the next Xid; that way, we will not try
+ 	 * to read data that might not have been set.
+ 	 *
+ 	 * XXX does this have a problem if a server is started with commitTs
+ 	 * enabled, then started with commitTs disabled, then restarted with it
+ 	 * enabled again?  It doesn't look like it does, because there should be a
+ 	 * checkpoint that sets the value to InvalidTransactionId at end of
+ 	 * recovery; and so any chance of injecting new transactions without
+ 	 * CommitTs values would occur after the oldestCommitTs has been set to
+ 	 * Invalid temporarily.
+ 	 */
+ 	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+ 		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+ 
+ 	/* Finally, create the current segment file, if necessary */
+ 	if (!SimpleLruDoesPhysicalPageExist(ctl, pageno))
+ 	{
+ 		int		slotno;
+ 
+ 		slotno = ZeroCommitTsPage(pageno, false);
+ 		SimpleLruWritePage(CommitTsCtl, slotno);
+ 		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+ 	}
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * This must be called ONCE during postmaster or standalone-backend shutdown
+  */
+ void
+ ShutdownCommitTs(void)
+ {
+ 	/* Flush dirty CommitTs pages to disk */
+ 	SimpleLruFlush(CommitTsCtl, false);
+ }
+ 
+ /*
+  * Perform a checkpoint --- either during shutdown, or on-the-fly
+  */
+ void
+ CheckPointCommitTs(void)
+ {
+ 	/* Flush dirty CommitTs pages to disk */
+ 	SimpleLruFlush(CommitTsCtl, true);
+ }
+ 
+ /*
+  * Make sure that CommitTs has room for a newly-allocated XID.
+  *
+  * NB: this is called while holding XidGenLock.  We want it to be very fast
+  * most of the time; even when it's not so fast, no actual I/O need happen
+  * unless we're forced to write out a dirty CommitTs or xlog page to make room
+  * in shared memory.
+  */
+ void
+ ExtendCommitTs(TransactionId newestXact)
+ {
+ 	int			pageno;
+ 
+ 	/* nothing to do if module not enabled */
+ 	if (!commit_ts_enabled)
+ 		return;
+ 
+ 	/*
+ 	 * No work except at first XID of a page.  But beware: just after
+ 	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+ 	 */
+ 	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+ 		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+ 		return;
+ 
+ 	pageno = TransactionIdToCTsPage(newestXact);
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	/* Zero the page and make an XLOG entry about it */
+ 	ZeroCommitTsPage(pageno, !InRecovery);
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Remove all CommitTs segments before the one holding the passed
+  * transaction ID
+  *
+  * Note that we don't need to flush XLOG here.
+  */
+ void
+ TruncateCommitTs(TransactionId oldestXact)
+ {
+ 	int			cutoffPage;
+ 
+ 	/*
+ 	 * The cutoff point is the start of the segment containing oldestXact. We
+ 	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+ 	 */
+ 	cutoffPage = TransactionIdToCTsPage(oldestXact);
+ 
+ 	/* Check to see if there's any files that could be removed */
+ 	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+ 		return;					/* nothing to remove */
+ 
+ 	/* Write XLOG record */
+ 	WriteTruncateXlogRec(cutoffPage);
+ 
+ 	/* Now we can remove the old CommitTs segment(s) */
+ 	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+ }
+ 
+ /*
+  * Set the earliest value for which commit TS can be consulted.
+  */
+ void
+ SetCommitTsLimit(TransactionId oldestXact)
+ {
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 	ShmemVariableCache->oldestCommitTs = oldestXact;
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Decide which of two CLOG page numbers is "older" for truncation purposes.
+  *
+  * We need to use comparison of TransactionIds here in order to do the right
+  * thing with wraparound XID arithmetic.  However, if we are asked about
+  * page number zero, we don't want to hand InvalidTransactionId to
+  * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+  * offset both xids by FirstNormalTransactionId to avoid that.
+  */
+ static bool
+ CommitTsPagePrecedes(int page1, int page2)
+ {
+ 	TransactionId xid1;
+ 	TransactionId xid2;
+ 
+ 	xid1 = ((TransactionId) page1) * COMMITTS_XACTS_PER_PAGE;
+ 	xid1 += FirstNormalTransactionId;
+ 	xid2 = ((TransactionId) page2) * COMMITTS_XACTS_PER_PAGE;
+ 	xid2 += FirstNormalTransactionId;
+ 
+ 	return TransactionIdPrecedes(xid1, xid2);
+ }
+ 
+ 
+ /*
+  * Write a ZEROPAGE xlog record
+  */
+ static void
+ WriteZeroPageXlogRec(int pageno)
+ {
+ 	XLogRecData rdata;
+ 
+ 	rdata.data = (char *) (&pageno);
+ 	rdata.len = sizeof(int);
+ 	rdata.buffer = InvalidBuffer;
+ 	rdata.next = NULL;
+ 	(void) XLogInsert(RM_COMMITTS_ID, COMMITTS_ZEROPAGE, &rdata);
+ }
+ 
+ /*
+  * Write a TRUNCATE xlog record
+  */
+ static void
+ WriteTruncateXlogRec(int pageno)
+ {
+ 	XLogRecData rdata;
+ 
+ 	rdata.data = (char *) (&pageno);
+ 	rdata.len = sizeof(int);
+ 	rdata.buffer = InvalidBuffer;
+ 	rdata.next = NULL;
+ 	XLogInsert(RM_COMMITTS_ID, COMMITTS_TRUNCATE, &rdata);
+ }
+ 
+ /*
+  * Write a SETTS xlog record
+  */
+ static void
+ WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+ 						 TransactionId *subxids, TimestampTz timestamp,
+ 						 CommitExtraData data)
+ {
+ 	XLogRecData	rdata;
+ 	xl_committs_set	record;
+ 
+ 	record.timestamp = timestamp;
+ 	record.data = data;
+ 	record.mainxid = mainxid;
+ 	record.nsubxids = nsubxids;
+ 	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+ 
+ 	rdata.data = (char *) &record;
+ 	rdata.len = offsetof(xl_committs_set, subxids) +
+ 		nsubxids * sizeof(TransactionId);
+ 	rdata.buffer = InvalidBuffer;
+ 	rdata.next = NULL;
+ 	XLogInsert(RM_COMMITTS_ID, COMMITTS_SETTS, &rdata);
+ }
+ 
+ 
+ /*
+  * CommitTS resource manager's routines
+  */
+ void
+ committs_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+ 
+ 	/* Backup blocks are not used in committs records */
+ 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ 
+ 	if (info == COMMITTS_ZEROPAGE)
+ 	{
+ 		int			pageno;
+ 		int			slotno;
+ 
+ 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ 
+ 		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 		slotno = ZeroCommitTsPage(pageno, false);
+ 		SimpleLruWritePage(CommitTsCtl, slotno);
+ 		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+ 
+ 		LWLockRelease(CommitTsControlLock);
+ 	}
+ 	else if (info == COMMITTS_TRUNCATE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ 
+ 		/*
+ 		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+ 		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+ 		 */
+ 		CommitTsCtl->shared->latest_page_number = pageno;
+ 
+ 		SimpleLruTruncate(CommitTsCtl, pageno);
+ 	}
+ 	else if (info == COMMITTS_SETTS)
+ 	{
+ 		xl_committs_set *setts = (xl_committs_set *) XLogRecGetData(record);
+ 
+ 		TransactionTreeSetCommitTimestamp(setts->mainxid, setts->nsubxids,
+ 										  setts->subxids, setts->timestamp,
+ 										  setts->data, false);
+ 	}
+ 	else
+ 		elog(PANIC, "committs_redo: unknown op code %u", info);
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 8,13 ****
--- 8,14 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/gin.h"
  #include "access/gist_private.h"
  #include "access/hash.h"
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 14,19 ****
--- 14,20 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
***************
*** 157,165 **** GetNewTransactionId(bool isSubXact)
  	 * XID before we zero the page.  Fortunately, a page of the commit log
  	 * holds 32K or more transactions, so we don't have to do this very often.
  	 *
! 	 * Extend pg_subtrans too.
  	 */
  	ExtendCLOG(xid);
  	ExtendSUBTRANS(xid);
  
  	/*
--- 158,167 ----
  	 * XID before we zero the page.  Fortunately, a page of the commit log
  	 * holds 32K or more transactions, so we don't have to do this very often.
  	 *
! 	 * Extend pg_subtrans and pg_committs too.
  	 */
  	ExtendCLOG(xid);
+ 	ExtendCommitTs(xid);
  	ExtendSUBTRANS(xid);
  
  	/*
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 20,25 ****
--- 20,26 ----
  #include <time.h>
  #include <unistd.h>
  
+ #include "access/committs.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
  #include "access/transam.h"
***************
*** 1118,1123 **** RecordTransactionCommit(void)
--- 1119,1132 ----
  	}
  
  	/*
+ 	 * We don't need to log the commit timestamp separately since the commit
+ 	 * record logged above has all the necessary action to set the timestamp
+ 	 * again.
+ 	 */
+ 	TransactionTreeSetCommitTimestamp(xid, nchildren, children,
+ 									  xactStopTimestamp, 0, false);
+ 
+ 	/*
  	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
  	 * to happen asynchronously if synchronous_commit=off, or if the current
  	 * transaction has not performed any WAL-logged operation.	The latter
***************
*** 4563,4568 **** xactGetCommittedChildren(TransactionId **ptr)
--- 4572,4578 ----
   */
  static void
  xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+ 						  TimestampTz commit_time,
  						  TransactionId *sub_xids, int nsubxacts,
  						  SharedInvalidationMessage *inval_msgs, int nmsgs,
  						  RelFileNode *xnodes, int nrels,
***************
*** 4590,4595 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
--- 4600,4609 ----
  		LWLockRelease(XidGenLock);
  	}
  
+ 	/* Set the transaction commit time */
+ 	TransactionTreeSetCommitTimestamp(xid, nsubxacts, sub_xids,
+ 									  commit_time, 0, false);
+ 
  	if (standbyState == STANDBY_DISABLED)
  	{
  		/*
***************
*** 4709,4715 **** xact_redo_commit(xl_xact_commit *xlrec,
  	/* invalidation messages array follows subxids */
  	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
  
! 	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
  							  inval_msgs, xlrec->nmsgs,
  							  xlrec->xnodes, xlrec->nrels,
  							  xlrec->dbId,
--- 4723,4730 ----
  	/* invalidation messages array follows subxids */
  	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
  
! 	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
! 							  subxacts, xlrec->nsubxacts,
  							  inval_msgs, xlrec->nmsgs,
  							  xlrec->xnodes, xlrec->nrels,
  							  xlrec->dbId,
***************
*** 4724,4730 **** static void
  xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
  						 TransactionId xid, XLogRecPtr lsn)
  {
! 	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
  							  NULL, 0,	/* inval msgs */
  							  NULL, 0,	/* relfilenodes */
  							  InvalidOid,		/* dbId */
--- 4739,4746 ----
  xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
  						 TransactionId xid, XLogRecPtr lsn)
  {
! 	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
! 							  xlrec->subxacts, xlrec->nsubxacts,
  							  NULL, 0,	/* inval msgs */
  							  NULL, 0,	/* relfilenodes */
  							  InvalidOid,		/* dbId */
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 22,27 ****
--- 22,28 ----
  #include <unistd.h>
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
  #include "access/timeline.h"
***************
*** 5183,5188 **** BootStrapXLOG(void)
--- 5184,5190 ----
  	checkPoint.oldestXidDB = TemplateDbOid;
  	checkPoint.oldestMulti = FirstMultiXactId;
  	checkPoint.oldestMultiDB = TemplateDbOid;
+ 	checkPoint.oldestCommitTs = InvalidTransactionId;
  	checkPoint.time = (pg_time_t) time(NULL);
  	checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 5192,5197 **** BootStrapXLOG(void)
--- 5194,5200 ----
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+ 	SetCommitTsLimit(InvalidTransactionId);
  
  	/* Set up the XLOG page header */
  	page->xlp_magic = XLOG_PAGE_MAGIC;
***************
*** 5272,5277 **** BootStrapXLOG(void)
--- 5275,5281 ----
  
  	/* Bootstrap the commit log, too */
  	BootStrapCLOG();
+ 	BootStrapCommitTs();
  	BootStrapSUBTRANS();
  	BootStrapMultiXact();
  
***************
*** 6318,6323 **** StartupXLOG(void)
--- 6322,6330 ----
  	ereport(DEBUG1,
  			(errmsg("oldest MultiXactId: %u, in database %u",
  					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+ 	ereport(DEBUG1,
+ 			(errmsg("oldest CommitTs Xid: %u",
+ 					checkPoint.oldestCommitTs)));
  	if (!TransactionIdIsNormal(checkPoint.nextXid))
  		ereport(PANIC,
  				(errmsg("invalid next transaction ID")));
***************
*** 6329,6334 **** StartupXLOG(void)
--- 6336,6342 ----
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+ 	SetCommitTsLimit(checkPoint.oldestCommitTs);
  	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
  	XLogCtl->ckptXid = checkPoint.nextXid;
  
***************
*** 6532,6541 **** StartupXLOG(void)
  			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
  
  			/*
! 			 * Startup commit log and subtrans only. Other SLRUs are not
! 			 * maintained during recovery and need not be started yet.
  			 */
  			StartupCLOG();
  			StartupSUBTRANS(oldestActiveXID);
  
  			/*
--- 6540,6551 ----
  			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
  
  			/*
! 			 * Startup commit log, commit timestamp, and subtrans only. Other
! 			 * SLRUs are not maintained during recovery and need not be started
! 			 * yet.
  			 */
  			StartupCLOG();
+ 			StartupCommitTs();
  			StartupSUBTRANS(oldestActiveXID);
  
  			/*
***************
*** 7191,7196 **** StartupXLOG(void)
--- 7201,7207 ----
  	if (standbyState == STANDBY_DISABLED)
  	{
  		StartupCLOG();
+ 		StartupCommitTs();
  		StartupSUBTRANS(oldestActiveXID);
  	}
  
***************
*** 7759,7764 **** ShutdownXLOG(int code, Datum arg)
--- 7770,7776 ----
  		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
  	}
  	ShutdownCLOG();
+ 	ShutdownCommitTs();
  	ShutdownSUBTRANS();
  	ShutdownMultiXact();
  
***************
*** 8152,8157 **** CreateCheckPoint(int flags)
--- 8164,8173 ----
  	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
  	LWLockRelease(XidGenLock);
  
+ 	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+ 	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+ 	LWLockRelease(CommitTsControlLock);
+ 
  	/* Increase XID epoch if we've wrapped around since last checkpoint */
  	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
***************
*** 8392,8397 **** static void
--- 8408,8414 ----
  CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
  {
  	CheckPointCLOG();
+ 	CheckPointCommitTs();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
  	CheckPointPredicate();
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***************
*** 23,28 ****
--- 23,29 ----
  #include <math.h>
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/htup_details.h"
***************
*** 894,901 **** vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti)
  		return;
  	}
  
! 	/* Truncate CLOG and Multi to the oldest computed value */
  	TruncateCLOG(frozenXID);
  	TruncateMultiXact(minMulti);
  
  	/*
--- 895,903 ----
  		return;
  	}
  
! 	/* Truncate CLOG, CommitTS and Multi to the oldest computed values */
  	TruncateCLOG(frozenXID);
+ 	TruncateCommitTs(frozenXID);
  	TruncateMultiXact(minMulti);
  
  	/*
***************
*** 906,911 **** vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti)
--- 908,914 ----
  	 */
  	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
  	MultiXactAdvanceOldest(minMulti, minmulti_datoid);
+ 	SetCommitTsLimit(frozenXID);
  }
  
  
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
***************
*** 15,20 ****
--- 15,21 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/heapam.h"
  #include "access/multixact.h"
  #include "access/nbtree.h"
***************
*** 113,118 **** CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
--- 114,120 ----
  		size = add_size(size, ProcGlobalShmemSize());
  		size = add_size(size, XLOGShmemSize());
  		size = add_size(size, CLOGShmemSize());
+ 		size = add_size(size, CommitTsShmemSize());
  		size = add_size(size, SUBTRANSShmemSize());
  		size = add_size(size, TwoPhaseShmemSize());
  		size = add_size(size, BackgroundWorkerShmemSize());
***************
*** 195,200 **** CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
--- 197,203 ----
  	 */
  	XLOGShmemInit();
  	CLOGShmemInit();
+ 	CommitTsShmemInit();
  	SUBTRANSShmemInit();
  	MultiXactShmemInit();
  	InitBufferPool();
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 46,51 ****
--- 46,52 ----
  #include <signal.h>
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
***************
*** 2692,2697 **** RecordKnownAssignedTransactionIds(TransactionId xid)
--- 2693,2699 ----
  		while (TransactionIdPrecedesOrEquals(next_expected_xid, xid))
  		{
  			ExtendCLOG(next_expected_xid);
+ 			ExtendCommitTs(next_expected_xid);
  			ExtendSUBTRANS(next_expected_xid);
  
  			TransactionIdAdvance(next_expected_xid);
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
***************
*** 22,27 ****
--- 22,28 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/committs.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
  #include "commands/async.h"
***************
*** 191,196 **** NumLWLocks(void)
--- 192,200 ----
  	/* clog.c needs one per CLOG buffer */
  	numLocks += CLOGShmemBuffers();
  
+ 	/* committs.c needs one per CommitTs buffer */
+ 	numLocks += CommitTsShmemBuffers();
+ 
  	/* subtrans.c needs one per SubTrans buffer */
  	numLocks += NUM_SUBTRANS_BUFFERS;
  
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 26,31 ****
--- 26,32 ----
  #include <syslog.h>
  #endif
  
+ #include "access/committs.h"
  #include "access/gin.h"
  #include "access/transam.h"
  #include "access/twophase.h"
***************
*** 792,797 **** static struct config_bool ConfigureNamesBool[] =
--- 793,807 ----
  		check_bonjour, NULL, NULL
  	},
  	{
+ 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+ 			gettext_noop("Collects transaction commit time."),
+ 			NULL
+ 		},
+ 		&commit_ts_enabled,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
  			gettext_noop("Enables SSL connections."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,228 ----
  #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
  #wal_sender_timeout = 60s	# in milliseconds; 0 disables
  
+ #track_commit_timestamp = off	# collect timestamp of transaction commit
+ 				# (change requires restart)
+ 
  # - Master Server -
  
  # These settings are ignored on a standby server.
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
***************
*** 187,192 **** const char *subdirs[] = {
--- 187,193 ----
  	"pg_xlog",
  	"pg_xlog/archive_status",
  	"pg_clog",
+ 	"pg_committs",
  	"pg_dynshmem",
  	"pg_notify",
  	"pg_serial",
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 238,243 **** main(int argc, char *argv[])
--- 238,245 ----
  		   ControlFile.checkPointCopy.oldestMulti);
  	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
  		   ControlFile.checkPointCopy.oldestMultiDB);
+ 	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+ 		   ControlFile.checkPointCopy.oldestCommitTs);
  	printf(_("Time of latest checkpoint:            %s\n"),
  		   ckpttime_str);
  	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
*** /dev/null
--- b/src/include/access/committs.h
***************
*** 0 ****
--- 1,61 ----
+ /*
+  * committs.h
+  *
+  * PostgreSQL commit timestamp manager
+  *
+  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/access/committs.h
+  */
+ #ifndef COMMITTS_H
+ #define COMMITTS_H
+ 
+ #include "access/xlog.h"
+ #include "datatype/timestamp.h"
+ 
+ 
+ extern PGDLLIMPORT bool	commit_ts_enabled;
+ 
+ typedef uint32 CommitExtraData;
+ 
+ extern void TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+ 								  TransactionId *subxids,
+ 								  TimestampTz timestamp,
+ 								  CommitExtraData data,
+ 								  bool do_xlog);
+ extern void TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+ 							 CommitExtraData *data);
+ extern TransactionId GetLatestCommitTimestampData(TimestampTz *ts,
+ 							 CommitExtraData *extra);
+ 
+ extern Size CommitTsShmemBuffers(void);
+ extern Size CommitTsShmemSize(void);
+ extern void CommitTsShmemInit(void);
+ extern void BootStrapCommitTs(void);
+ extern void StartupCommitTs(void);
+ extern void ShutdownCommitTs(void);
+ extern void CheckPointCommitTs(void);
+ extern void ExtendCommitTs(TransactionId newestXact);
+ extern void TruncateCommitTs(TransactionId oldestXact);
+ extern void SetCommitTsLimit(TransactionId oldestXact);
+ 
+ /* XLOG stuff */
+ #define COMMITTS_ZEROPAGE		0x00
+ #define COMMITTS_TRUNCATE		0x10
+ #define COMMITTS_SETTS			0x20
+ 
+ typedef struct xl_committs_set
+ {
+ 	TimestampTz		timestamp;
+ 	CommitExtraData	data;
+ 	TransactionId	mainxid;
+ 	int				nsubxids;
+ 	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+ } xl_committs_set;
+ 
+ 
+ extern void committs_redo(XLogRecPtr lsn, XLogRecord *record);
+ extern void committs_desc(StringInfo buf, uint8 xl_info, char *rec);
+ 
+ #endif   /* COMMITTS_H */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
  PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
  PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL, NULL)
  PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup, NULL)
+ PG_RMGR(RM_COMMITTS_ID, "CommitTs", committs_redo, committs_desc, NULL, NULL, NULL)
*** a/src/include/access/transam.h
--- b/src/include/access/transam.h
***************
*** 119,124 **** typedef struct VariableCacheData
--- 119,129 ----
  	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
  
  	/*
+ 	 * These fields are protected by CommitTsControlLock
+ 	 */
+ 	TransactionId oldestCommitTs;
+ 
+ 	/*
  	 * These fields are protected by ProcArrayLock.
  	 */
  	TransactionId latestCompletedXid;	/* newest XID that has committed or
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 46,51 **** typedef struct CheckPoint
--- 46,52 ----
  	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
  	Oid			oldestMultiDB;	/* database with minimum datminmxid */
  	pg_time_t	time;			/* time stamp of checkpoint */
+ 	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
  
  	/*
  	 * Oldest XID still running. This is only needed to initialize hot standby
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2914,2919 **** DESCR("view two-phase transactions");
--- 2914,2931 ----
  DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
  DESCR("view members of a multixactid");
  
+ DATA(insert OID = 3461 ( pg_get_transaction_committime PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_get_transaction_committime _null_ _null_ _null_ ));
+ DESCR("get commit time of transaction");
+ 
+ DATA(insert OID = 3462 ( pg_get_transaction_extradata PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 23 "28" _null_ _null_ _null_ _null_ pg_get_transaction_extradata _null_ _null_ _null_ ));
+ DESCR("get additional data from transaction commit timestamp record");
+ 
+ DATA(insert OID = 3463 ( pg_get_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 2249 "28" "{28,1184,23}" "{i,o,o}" "{xid,committime,extradata}" _null_ pg_get_transaction_committime_data _null_ _null_ _null_ ));
+ DESCR("get commit time and additional data from transaction commit timestamp record");
+ 
+ DATA(insert OID = 3464 ( pg_get_latest_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184,23}" "{o,o,o}" "{xid,committime,extradata}" _null_ pg_get_latest_transaction_committime_data _null_ _null_ _null_ ));
+ DESCR("get transaction Id, commit timestamp and additional data of latest transaction commit");
+ 
  DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
  DESCR("get identification of SQL object");
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 59,64 **** typedef enum LWLockId
--- 59,66 ----
  	CheckpointLock,
  	CLogControlLock,
  	SubtransControlLock,
+ 	CommitTsControlLock,
+ 	CommitTsLock,
  	MultiXactGenLock,
  	MultiXactOffsetControlLock,
  	MultiXactMemberControlLock,
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
***************
*** 1151,1156 **** extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
--- 1151,1162 ----
  /* access/transam/multixact.c */
  extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
  
+ /* access/transam/committs.c */
+ extern Datum pg_get_transaction_committime(PG_FUNCTION_ARGS);
+ extern Datum pg_get_transaction_extradata(PG_FUNCTION_ARGS);
+ extern Datum pg_get_transaction_committime_data(PG_FUNCTION_ARGS);
+ extern Datum pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS);
+ 
  /* catalogs/dependency.c */
  extern Datum pg_describe_object(PG_FUNCTION_ARGS);
  extern Datum pg_identify_object(PG_FUNCTION_ARGS);
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#1)
Re: tracking commit timestamps

On Wed, Oct 23, 2013 at 3:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Hi,

There has been some interest in keeping track of timestamp of
transaction commits. This patch implements that.

Some of the use cases, I could think of are
1. Is it for usecases such that if user want to read all data of table
where transaction commit_ts <= '2012-04-04 09:30:00'?
2. for replication systems, may be the middleware can use it to replay
transactions in some remote system.
3. Is there any use of this feature in logical-rep/change data extraction?

There are some seemingly curious choices here. First, this module can
be disabled, and in fact it's turned off by default. At startup, we
verify whether it's enabled, and create the necessary SLRU segments if
so. And if the server is started with this disabled, we set the oldest
value we know about to avoid trying to read the commit TS of
transactions of which we didn't keep record. The ability to turn this
off is there to avoid imposing the overhead on systems that don't need
this feature.

Another thing of note is that we allow for some extra data alongside the
timestamp proper. This might be useful for a replication system that
wants to keep track of the origin node ID of a committed transaction,
for example. Exactly what will we do with the bit space we have is
unclear, so I have kept it generic and called it "commit extra data".

"commit extra data" can be LSN of commit log record, but I think it
will also depend on how someone wants to use this feature.
To suggest for storing LSN, I had referred information at below page
which describes about similar information for each transaction.
http://technet.microsoft.com/en-us/library/cc645959.aspx

This offers the chance for outside modules to set the commit TS of a
transaction; there is support for WAL-logging such values. But the core
user of the feature (RecordTransactionCommit) doesn't use it, because
xact.c's WAL logging itself is enough.

I have one question for the case when commits is set from
RecordTransactionCommit().

*** 1118,1123 **** RecordTransactionCommit(void)
--- 1119,1132 ----
  }
  /*
+ * We don't need to log the commit timestamp separately since the commit
+ * record logged above has all the necessary action to set the timestamp
+ * again.
+ */
+ TransactionTreeSetCommitTimestamp(xid, nchildren, children,
+  xactStopTimestamp, 0, false);
+

Here for CLOG, we are doing Xlogflush before writing to Clog page, but
Committs writes timestamp before XlogFlush().
Won't that create problem for synchronous commit as Checkpoint can
takecare of flushing Xlog for relation pages before flush of page,
but for Clog/Committs RecordTransactionCommit() should take care of doing it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jaime Casanova
jaime@2ndquadrant.com
In reply to: Alvaro Herrera (#1)
Re: tracking commit timestamps

On Tue, Oct 22, 2013 at 5:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Hi,

There has been some interest in keeping track of timestamp of
transaction commits. This patch implements that.

Hi,

Sorry for the delay on the review.

First, because of the recent fixes the patch doesn't apply cleanly
anymore but the changes seems to be easy.

=== performance ===

i expected a regression on performance with the module turned on
because of the new XLOG records and wrote of files in pg_committs but
the performance drop is excessive.

Master 437.835674 tps
Patch, guc off 436.4340982 tps
Patch, guc on 0.370524 tps

This is in a pgbench's db initialized with scale=50 and run with
"pgbench -c 64 -j 16 -n -T 300" 5 times (values above are the average
of runs)

postgresql changes:

shared_buffers = 1GB
full_page_writes = off
checkpoint_segments = 30
checkpoint_timeout = 15min
random_page_cost = 2.0

== functionality ==

I started the server with the module off, then after some work turned
the module on and restarted the server and run a few benchs then
turned it off again and restart the server and get a message like:

"""
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3112AE58
LOG: redo is not required
FATAL: cannot make new WAL entries during recovery
LOG: startup process (PID 24876) exited with exit code 1
LOG: aborting startup due to startup process failure
"""

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@2ndquadrant.com
In reply to: Jaime Casanova (#3)
Re: tracking commit timestamps

On 2013-12-02 02:39:55 -0500, Jaime Casanova wrote:

=== performance ===

i expected a regression on performance with the module turned on
because of the new XLOG records and wrote of files in pg_committs but
the performance drop is excessive.
Master 437.835674 tps
Patch, guc off 436.4340982 tps
Patch, guc on 0.370524 tps

There clearly is something wrong. The additional amount of xlog records
should be nearly unnoticeable because committs piggybacks on commit
records.

I started the server with the module off, then after some work turned
the module on and restarted the server and run a few benchs then
turned it off again and restart the server and get a message like:

"""
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3112AE58
LOG: redo is not required
FATAL: cannot make new WAL entries during recovery
LOG: startup process (PID 24876) exited with exit code 1
LOG: aborting startup due to startup process failure
"""

Alvaro: That's because of the location you call StartupCommitts - a)
it's called at the beginning of recovery if HS is enabled b) it's called
before StartupXLOG() does LocalSetXLogInsertAllowed().

So I think you need to split StartupCommitts into StartupCommitts() and
TrimCommitts() where only the latter does the trimming if committs is
disabled.
I also wonder if track_commit_timestamp should be tracked by the the
XLOG_PARAMETER_CHANGE stuff? So it's not disabled on the standby when
it's enabled on the primary?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Alvaro Herrera (#1)
Re: tracking commit timestamps

On 10/23/2013 01:16 AM, Alvaro Herrera wrote:

There has been some interest in keeping track of timestamp of
transaction commits. This patch implements that.

There are some seemingly curious choices here. First, this module can
be disabled, and in fact it's turned off by default. At startup, we
verify whether it's enabled, and create the necessary SLRU segments if
so. And if the server is started with this disabled, we set the oldest
value we know about to avoid trying to read the commit TS of
transactions of which we didn't keep record. The ability to turn this
off is there to avoid imposing the overhead on systems that don't need
this feature.

Another thing of note is that we allow for some extra data alongside the
timestamp proper. This might be useful for a replication system that
wants to keep track of the origin node ID of a committed transaction,
for example. Exactly what will we do with the bit space we have is
unclear, so I have kept it generic and called it "commit extra data".

This offers the chance for outside modules to set the commit TS of a
transaction; there is support for WAL-logging such values. But the core
user of the feature (RecordTransactionCommit) doesn't use it, because
xact.c's WAL logging itself is enough. For systems that are replicating
transactions from remote nodes, it is useful.

We also keep track of the latest committed transaction. This is
supposed to be useful to calculate replication lag.

Generally speaking, I'm not in favor of adding dead code, even if it
might be useful to someone in the future. For one, it's going to get
zero testing. Once someone comes up with an actual use case, let's add
that stuff at that point. Otherwise there's a good chance that we build
something that's almost but not quite useful.

Speaking of the functionality this does offer, it seems pretty limited.
A commit timestamp is nice, but it isn't very interesting on its own.
You really also want to know what the transaction did, who ran it, etc.
ISTM some kind of a auditing or log-parsing system that could tell you
all that would be much more useful, but this patch doesn't get us any
closer to that.

Does this handle XID wraparound correctly? SLRU has a maximum of 64k
segments with 32 SLRU pages each. With 12 bytes per each commit entry,
that's not enough to hold the timestamp and "commit extra data" of the
whole 2^31 XID range: (8192 * 32 * 65536) / 12 = 1431655765. And that's
with the default page size, with smaller pages you run into the limit
quicker.

It would be nice to teach SLRU machinery how to deal with more than 64k
segments. SSI code in twophase.c ran into the same limit, and all you
get is a warning there.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#5)
Re: tracking commit timestamps

On 2013-12-10 11:56:45 +0200, Heikki Linnakangas wrote:

Speaking of the functionality this does offer, it seems pretty limited. A
commit timestamp is nice, but it isn't very interesting on its own. You
really also want to know what the transaction did, who ran it, etc. ISTM
some kind of a auditing or log-parsing system that could tell you all that
would be much more useful, but this patch doesn't get us any closer to
that.

It's useful for last-update-wins for async multimaster. Currently
several userspace solutions try to approximate it by inserting a
timestamps into a column when a row is inserted or updated, but that is
quite limiting because either the number is out of date wrt. the commit
and/or it will differ between the rows.

I don't see how you could get an accurate timestamp in a significantly
different way?

Does this handle XID wraparound correctly? SLRU has a maximum of 64k
segments with 32 SLRU pages each. With 12 bytes per each commit entry,
that's not enough to hold the timestamp and "commit extra data" of the whole
2^31 XID range: (8192 * 32 * 65536) / 12 = 1431655765. And that's with the
default page size, with smaller pages you run into the limit quicker.

It would be nice to teach SLRU machinery how to deal with more than 64k
segments. SSI code in twophase.c ran into the same limit, and all you get is
a warning there.

Yea, 9.3 is already running afoul of that, because of the changed size
for the multixact member pages. Came up just yesterday in the course of
#8673.

(gdb) p/x (1L<<32)/(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)
$10 = 0x14078

Is this limitation actually documented anywhere? And is there anything
that needs to be changed but SlruScanDirectory()?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: tracking commit timestamps

On Tue, Dec 10, 2013 at 4:56 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Generally speaking, I'm not in favor of adding dead code, even if it might
be useful to someone in the future. For one, it's going to get zero testing.
Once someone comes up with an actual use case, let's add that stuff at that
point. Otherwise there's a good chance that we build something that's almost
but not quite useful.

Fair.

Speaking of the functionality this does offer, it seems pretty limited. A
commit timestamp is nice, but it isn't very interesting on its own. You
really also want to know what the transaction did, who ran it, etc. ISTM
some kind of a auditing or log-parsing system that could tell you all that
would be much more useful, but this patch doesn't get us any closer to that.

For what it's worth, I think that this has been requested numerous
times over the years by numerous developers of replication solutions.
My main question (apart from whether or not it may have bugs) is
whether it makes a noticeable performance difference. If it does,
that sucks. If it does not, maybe we ought to enable it by default.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#7)
Re: tracking commit timestamps

Robert Haas escribi�:

On Tue, Dec 10, 2013 at 4:56 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Speaking of the functionality this does offer, it seems pretty limited. A
commit timestamp is nice, but it isn't very interesting on its own. You
really also want to know what the transaction did, who ran it, etc. ISTM
some kind of a auditing or log-parsing system that could tell you all that
would be much more useful, but this patch doesn't get us any closer to that.

For what it's worth, I think that this has been requested numerous
times over the years by numerous developers of replication solutions.
My main question (apart from whether or not it may have bugs) is
whether it makes a noticeable performance difference. If it does,
that sucks. If it does not, maybe we ought to enable it by default.

I expect it will have some performance impact -- this is why we made it
disable-able in the first place, and why I went to the trouble of
ensuring it can be turned on after initdb. Normal pg_clog entries are 2
bits per transaction, whereas the commit timestamp stuff adds 12 *bytes*
per transaction. Not something to be taken lightly, hence it's off by
default. Presumably people who is using one of those replication
systems is okay with taking some (reasonable) performance hit.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#8)
Re: tracking commit timestamps

On Tue, Dec 10, 2013 at 4:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas escribió:

On Tue, Dec 10, 2013 at 4:56 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Speaking of the functionality this does offer, it seems pretty limited. A
commit timestamp is nice, but it isn't very interesting on its own. You
really also want to know what the transaction did, who ran it, etc. ISTM
some kind of a auditing or log-parsing system that could tell you all that
would be much more useful, but this patch doesn't get us any closer to that.

For what it's worth, I think that this has been requested numerous
times over the years by numerous developers of replication solutions.
My main question (apart from whether or not it may have bugs) is
whether it makes a noticeable performance difference. If it does,
that sucks. If it does not, maybe we ought to enable it by default.

I expect it will have some performance impact -- this is why we made it
disable-able in the first place, and why I went to the trouble of
ensuring it can be turned on after initdb. Normal pg_clog entries are 2
bits per transaction, whereas the commit timestamp stuff adds 12 *bytes*
per transaction. Not something to be taken lightly, hence it's off by
default. Presumably people who is using one of those replication
systems is okay with taking some (reasonable) performance hit.

Well, writing 12 extra bytes (why not 8?) on each commit is not
intrinsically that expensive. The (poor) design of SLRU might make it
expensive, though, because since it has no fsync absorption queue, so
sometimes you end up waiting for an fsync, and doing that 48x more
often will indeed have some cost. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#9)
1 attachment(s)
Re: tracking commit timestamps

Hi,

I worked bit on this patch to make it closer to committable state.

There are several bugs fixed, including ones mentioned by Jamie (writing
WAL during recovery).

Also support for pg_resetxlog/pg_upgrade has been implemented by Andres.

I added simple regression test and regression contrib module to cover
both off and on settings.

The SLRU issue Heikki mentioned should be also gone mainly thanks to
638cf09e7 (I did test it too).

One notable thing missing is documentation for the three SQL level
interfaces provided, I plan to add that soon.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v5.patchtext/x-diff; name=committs-v5.patchDownload
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index cbcaaa6..81cbcaf 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -9,6 +9,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 0000000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	    --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	    --temp-install=./tmp_check \
+	    --extra-install=contrib/test_committs \
+	    --outputdir=./regression_output \
+	    $(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	    --extra-install=contrib/test_committs \
+	    $(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 0000000..9920343
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (on)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
+----+------------------------------+----------+----------+----------
+  1 |                            0 | t        | t        | t
+  2 |                            0 | t        | t        | t
+  3 |                            0 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/contrib/test_committs/sql/committs_on.sql b/contrib/test_committs/sql/committs_on.sql
new file mode 100644
index 0000000..aec6438
--- /dev/null
+++ b/contrib/test_committs/sql/committs_on.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (on)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 49547ee..8516f72 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2555,6 +2555,21 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions.  This parameter
+        can only be set in
+        the <filename>postgresql.conf</> file or on the server command line.
+        The default value is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 7d092d2..20c88a8 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o \
+       heapdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..320bec3
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,53 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "utils/timestamp.h"
+
+
+void
+committs_desc(StringInfo buf, uint8 xl_info, char *rec)
+{
+	uint8		info = xl_info & ~XLR_INFO_MASK;
+
+	if (info == COMMITTS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "zeropage: %d", pageno);
+	}
+	else if (info == COMMITTS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "truncate before: %d", pageno);
+	}
+	else if (info == COMMITTS_SETTS)
+	{
+		xl_committs_set *xlrec = (xl_committs_set *) rec;
+		int		i;
+
+		appendStringInfo(buf, "set committs %s for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->mainxid);
+		for (i = 0; i < xlrec->nsubxids; i++)
+			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+	}
+	else
+		appendStringInfo(buf, "UNKNOWN");
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 2224da1..65fd5dd 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 		appendStringInfo(buf, "checkpoint: redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest CommitTs xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index eb6cfc5..ace913e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xlogreader.o xlogutils.o
+	xlogreader.o xlogutils.o committs.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 27ca4c6..3300f84 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -152,8 +152,7 @@ TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
 		   status == TRANSACTION_STATUS_ABORTED);
 
 	/*
-	 * See how many subxids, if any, are on the same page as the parent, if
-	 * any.
+	 * See how many subxids, if any, are on the same page as the parent.
 	 */
 	for (i = 0; i < nsubxids; i++)
 	{
diff --git a/src/backend/access/transam/committs.c b/src/backend/access/transam/committs.c
new file mode 100644
index 0000000..e7298a5
--- /dev/null
+++ b/src/backend/access/transam/committs.c
@@ -0,0 +1,846 @@
+/*-------------------------------------------------------------------------
+ *
+ * committs.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/committs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/* We need 8+4 bytes per xact */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	CommitExtraData	extra;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, extra) + \
+									sizeof(CommitExtraData))
+
+#define COMMITTS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMITTS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMITTS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variables */
+bool	commit_ts_enabled;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz committs,
+					 CommitExtraData extra, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+						  CommitExtraData extra, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitExtraData data);
+
+
+/*
+ * TransactionTreeSetCommitTimestamp
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+								  TransactionId *subxids, TimestampTz timestamp,
+								  CommitExtraData extra, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!commit_ts_enabled)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, extra);
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, extra,
+							 pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.extra = extra;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz committs,
+					 CommitExtraData extra, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, committs, extra, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], committs, extra, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+						 CommitExtraData extra, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry *entry;
+
+	entry = (CommitTimestampEntry *)
+		(CommitTsCtl->shared->page_buffer[slotno] +
+		 SizeOfCommitTimestampEntry * entryno);
+
+	entry->time = committs;
+	entry->extra = extra;
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ */
+void
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitExtraData *data)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry *entry;
+	TransactionId oldestCommitTs;
+
+	/* Return empty if module not enabled */
+	if (!commit_ts_enabled)
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (data)
+			*data = (CommitExtraData) 0;
+		return;
+	}
+
+	/* Also return empty if the requested value is older than what we have */
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs))
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (data)
+			*data = (CommitExtraData) 0;
+		return;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory;
+	 * if it's a hit, acquire a lock and read the data, after verifying
+	 * that it's still what we initially read.  Otherwise, fall through
+	 * to read from SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (data)
+				*data = commitTsShared->dataLastCommit.extra;
+			LWLockRelease(CommitTsLock);
+			return;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	entry = (CommitTimestampEntry *)
+		(CommitTsCtl->shared->page_buffer[slotno] +
+		 SizeOfCommitTimestampEntry * entryno);
+
+	if (ts)
+		*ts = entry->time;
+
+	if (data)
+		*data = entry->extra;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTimestampData(TimestampTz *ts, CommitExtraData *extra)
+{
+	TransactionId	xid;
+
+	/* Return empty if module not enabled */
+	if (!commit_ts_enabled)
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (extra)
+			*extra = (CommitExtraData) 0;
+		return InvalidTransactionId;
+	}
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (extra)
+		*extra = commitTsShared->dataLastCommit.extra;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+PG_FUNCTION_INFO_V1(pg_get_transaction_committime);
+Datum
+pg_get_transaction_committime(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		committs;
+
+	TransactionIdGetCommitTsData(xid, &committs, NULL);
+
+	PG_RETURN_TIMESTAMPTZ(committs);
+}
+
+PG_FUNCTION_INFO_V1(pg_get_transaction_extradata);
+Datum
+pg_get_transaction_extradata(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	CommitExtraData	data;
+
+	TransactionIdGetCommitTsData(xid, NULL, &data);
+
+	PG_RETURN_INT32(data);
+}
+
+PG_FUNCTION_INFO_V1(pg_get_transaction_committime_data);
+Datum
+pg_get_transaction_committime_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		committs;
+	CommitExtraData	data;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "extra",
+					   INT4OID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	TransactionIdGetCommitTsData(xid, &committs, &data);
+
+	values[0] = TimestampTzGetDatum(committs);
+	nulls[0] = false;
+
+	values[1] = Int32GetDatum(data);
+	nulls[1] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+PG_FUNCTION_INFO_V1(pg_get_latest_transaction_committime_data);
+Datum
+pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		committs;
+	CommitExtraData	data;
+	Datum       values[3];
+	bool        nulls[3];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(3, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "extra",
+					   INT4OID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTimestampData(&committs, &data);
+
+	values[0] = TransactionIdGetDatum(xid);
+	nulls[0] = false;
+
+	values[1] = TimestampTzGetDatum(committs);
+	nulls[1] = false;
+
+	values[2] = Int32GetDatum(data);
+	nulls[2] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Initialization of shared memory for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_committs");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		commitTsShared->dataLastCommit.time = 0;
+		commitTsShared->dataLastCommit.extra = 0;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.
+ * Must be called after recovery has finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+InitCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!commit_ts_enabled)
+	{
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsControlLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest value to the next Xid; that way, we will not try
+	 * to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+	}
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!commit_ts_enabled)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the earliest value for which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMITTS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMITTS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	(void) XLogInsert(RM_COMMITTS_ID, COMMITTS_ZEROPAGE, &rdata);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMITTS_ID, COMMITTS_TRUNCATE, &rdata);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitExtraData data)
+{
+	XLogRecData	rdata;
+	xl_committs_set	record;
+
+	record.timestamp = timestamp;
+	record.data = data;
+	record.mainxid = mainxid;
+	record.nsubxids = nsubxids;
+	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+
+	rdata.data = (char *) &record;
+	rdata.len = offsetof(xl_committs_set, subxids) +
+		nsubxids * sizeof(TransactionId);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMITTS_ID, COMMITTS_SETTS, &rdata);
+}
+
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+committs_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in committs records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+	if (info == COMMITTS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMITTS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMITTS_SETTS)
+	{
+		xl_committs_set *setts = (xl_committs_set *) XLogRecGetData(record);
+
+		TransactionTreeSetCommitTimestamp(setts->mainxid, setts->nsubxids,
+										  setts->subxids, setts->timestamp,
+										  setts->data, false);
+	}
+	else
+		elog(PANIC, "committs_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index c0a7a6f..4f993e4 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 7013fb8..c70bebe 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -157,9 +158,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_committs too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5b5d31b..ca5d28f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1166,6 +1167,17 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We don't need to log the commit timestamp separately since the commit
+	 * record logged above has all the necessary action to set the timestamp
+	 * again.
+	 */
+	if (markXidCommitted)
+	{
+		TransactionTreeSetCommitTimestamp(xid, nchildren, children,
+										  xactStopTimestamp, 0, false);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4683,6 +4695,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4710,6 +4723,10 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit time */
+	TransactionTreeSetCommitTimestamp(xid, nsubxacts, sub_xids,
+									  commit_time, 0, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4829,7 +4846,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4844,7 +4862,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 34f2fc0..2e5d1c1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4983,6 +4984,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4992,6 +4994,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -5073,6 +5076,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -6334,6 +6338,9 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("oldest CommitTs Xid: %u",
+					checkPoint.oldestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -6345,6 +6352,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6569,11 +6577,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans
+			 * only. MultiXact has already been started up and other SLRUs are
+			 * not maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -7220,12 +7229,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -7261,6 +7271,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enables, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	InitCommitTs();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7828,6 +7844,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -8181,6 +8198,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -8471,6 +8492,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e5fefa3..f5e7ddc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1055,6 +1056,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * multixacts; that will be done by the next checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1064,6 +1066,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	SetCommitTsLimit(frozenXID);
 }
 
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f1b20e..f9b49c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -132,6 +132,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
+		case RM_COMMITTS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..9025601 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ea82882..fb0e20d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,7 @@
 #include <signal.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..c4df0d5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* committs.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af667f5..139bebb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -815,6 +816,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&commit_ts_enabled,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df98b02..4dae3ad 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -226,6 +226,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 0	# max number of replication slots
+#track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
 
 # - Master Server -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c8ff2cb..3935bab 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,6 +185,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_committs",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f815024..c164732 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -240,6 +240,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 302d005..4f4ef3c 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -62,6 +62,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_committs = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -111,7 +112,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "fl:m:no:O:x:e:c:")) != -1)
 	{
 		switch (c)
 		{
@@ -153,6 +154,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_committs = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_committs == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -329,6 +345,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_committs != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_committs;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -605,6 +624,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -686,6 +707,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_committs != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -1088,6 +1115,7 @@ usage(void)
 	printf(_("  -O OFFSET        set next multitransaction offset\n"));
 	printf(_("  -V, --version    output version information, then exit\n"));
 	printf(_("  -x XID           set next transaction ID\n"));
+	printf(_("  -c XID           set the oldest retrievable commit timestamp\n"));
 	printf(_("  -?, --help       show this help, then exit\n"));
 	printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));
 }
diff --git a/src/include/access/committs.h b/src/include/access/committs.h
new file mode 100644
index 0000000..c51c149
--- /dev/null
+++ b/src/include/access/committs.h
@@ -0,0 +1,62 @@
+/*
+ * committs.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/committs.h
+ */
+#ifndef COMMITTS_H
+#define COMMITTS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+
+
+extern PGDLLIMPORT bool	commit_ts_enabled;
+
+typedef uint32 CommitExtraData;
+
+extern void TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+								  TransactionId *subxids,
+								  TimestampTz timestamp,
+								  CommitExtraData data,
+								  bool do_xlog);
+extern void TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitExtraData *data);
+extern TransactionId GetLatestCommitTimestampData(TimestampTz *ts,
+							 CommitExtraData *extra);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void InitCommitTs(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMITTS_ZEROPAGE		0x00
+#define COMMITTS_TRUNCATE		0x10
+#define COMMITTS_SETTS			0x20
+
+typedef struct xl_committs_set
+{
+	TimestampTz		timestamp;
+	CommitExtraData	data;
+	TransactionId	mainxid;
+	int				nsubxids;
+	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+} xl_committs_set;
+
+
+extern void committs_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void committs_desc(StringInfo buf, uint8 xl_info, char *rec);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 662fb77..2b53267 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -42,3 +42,4 @@ PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+PG_RMGR(RM_COMMITTS_ID, "CommitTs", committs_redo, committs_desc, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..b59fd98 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,11 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsControlLock
+	 */
+	TransactionId oldestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ba79d25..9e048ea 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,7 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5176ed0..e418b14 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2984,6 +2984,18 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3787 ( pg_get_transaction_committime PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_get_transaction_committime _null_ _null_ _null_ ));
+DESCR("get commit time of transaction");
+
+DATA(insert OID = 3788 ( pg_get_transaction_extradata PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 23 "28" _null_ _null_ _null_ _null_ pg_get_transaction_extradata _null_ _null_ _null_ ));
+DESCR("get additional data from transaction commit timestamp record");
+
+DATA(insert OID = 3789 ( pg_get_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 2249 "28" "{28,1184,23}" "{i,o,o}" "{xid,committime,extradata}" _null_ pg_get_transaction_committime_data _null_ _null_ _null_ ));
+DESCR("get commit time and additional data from transaction commit timestamp record");
+
+DATA(insert OID = 3790 ( pg_get_latest_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184,23}" "{o,o,o}" "{xid,committime,extradata}" _null_ pg_get_latest_transaction_committime_data _null_ _null_ _null_ ));
+DESCR("get transaction Id, commit timestamp and additional data of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..5575f16 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 78cc0a0..c6a67b0 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1190,6 +1190,12 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_get_transaction_committime(PG_FUNCTION_ARGS);
+extern Datum pg_get_transaction_extradata(PG_FUNCTION_ARGS);
+extern Datum pg_get_transaction_committime_data(PG_FUNCTION_ARGS);
+extern Datum pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/committs_off.out b/src/test/regress/expected/committs_off.out
new file mode 100644
index 0000000..0a94f9d
--- /dev/null
+++ b/src/test/regress/expected/committs_off.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (off)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
+----+------------------------------+----------+----------+----------
+  1 |                            0 | f        | t        | t
+  2 |                            0 | f        | t        | t
+  3 |                            0 | f        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c0416f4..5ce1e0f 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,7 +88,7 @@ test: privileges security_label collate matview lock replica_identity
 # ----------
 # Another group of parallel tests
 # ----------
-test: alter_generic misc psql async
+test: alter_generic misc psql async committs_off
 
 # rules cannot run concurrently with any test that creates a view
 test: rules
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 16a1905..7a6cae9 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -145,3 +145,4 @@ test: largeobject
 test: with
 test: xml
 test: stats
+test: committs_off
diff --git a/src/test/regress/sql/committs_off.sql b/src/test/regress/sql/committs_off.sql
new file mode 100644
index 0000000..0f97666
--- /dev/null
+++ b/src/test/regress/sql/committs_off.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (off)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
#11Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#10)
1 attachment(s)
Re: tracking commit timestamps

On 09/09/14 19:05, Petr Jelinek wrote:

Hi,

I worked bit on this patch to make it closer to committable state.

There are several bugs fixed, including ones mentioned by Jamie (writing
WAL during recovery).

Also support for pg_resetxlog/pg_upgrade has been implemented by Andres.

I added simple regression test and regression contrib module to cover
both off and on settings.

The SLRU issue Heikki mentioned should be also gone mainly thanks to
638cf09e7 (I did test it too).

Here is updated version that works with current HEAD for the October
committfest.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v6.patchtext/x-diff; name=committs-v6.patchDownload
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index bfb3573..c0a0409 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -9,6 +9,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 0000000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	    --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	    --temp-install=./tmp_check \
+	    --extra-install=contrib/test_committs \
+	    --outputdir=./regression_output \
+	    $(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	    --extra-install=contrib/test_committs \
+	    $(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 0000000..9920343
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (on)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
+----+------------------------------+----------+----------+----------
+  1 |                            0 | t        | t        | t
+  2 |                            0 | t        | t        | t
+  3 |                            0 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/contrib/test_committs/sql/committs_on.sql b/contrib/test_committs/sql/committs_on.sql
new file mode 100644
index 0000000..aec6438
--- /dev/null
+++ b/contrib/test_committs/sql/committs_on.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (on)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9494439..ef4c41e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2669,6 +2669,21 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions.  This parameter
+        can only be set in
+        the <filename>postgresql.conf</> file or on the server command line.
+        The default value is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3a7cfa9..fa69c94 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15908,6 +15908,48 @@ SELECT collation for ('foo' COLLATE "de_DE");
     For example <literal>10:20:10,14,15</literal> means
     <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
    </para>
+
+   <para>
+    The functions shown in <xref linkend="functions-committs">
+    provide information about transactions that have been already committed.
+    These functions mainly provide information about when the transactions
+    were committed. They only provide useful data when
+    <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+    and only for transactions that were committed after it was enabled.
+   </para>
+
+   <table id="functions-committs">
+    <title>Committed transaction information</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry><literal><function>pg_get_transaction_committime(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>get commit time of transaction</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_get_transaction_extradata(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>integer</type></entry>
+       <entry>get additional data from transaction commit timestamp record</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_get_transaction_committime_data(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><parameter>committime</> <type>timestamp with time zone</>, <parameter>extradata</> <type>integer</></entry>
+       <entry>get commit time and additional data from transaction commit timestamp</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_get_latest_transaction_committime_data()</function></literal></entry>
+       <entry><parameter>xid</> <type>xid</>, <parameter>committime</> <type>timestamp with time zone</>, <parameter>extradata</> <type>integer</></entry>
+       <entry>get transaction Id, commit timestamp and additional data of latest transaction commit</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   </sect1>
 
   <sect1 id="functions-admin">
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 7d092d2..20c88a8 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o \
+       heapdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..2bf7fed
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "utils/timestamp.h"
+
+
+void
+committs_desc(StringInfo buf, XLogRecord *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	if (info == COMMITTS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "zeropage: %d", pageno);
+	}
+	else if (info == COMMITTS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "truncate before: %d", pageno);
+	}
+	else if (info == COMMITTS_SETTS)
+	{
+		xl_committs_set *xlrec = (xl_committs_set *) rec;
+		int		i;
+
+		appendStringInfo(buf, "set committs %s for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->mainxid);
+		for (i = 0; i < xlrec->nsubxids; i++)
+			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+	}
+	else
+		appendStringInfo(buf, "UNKNOWN");
+}
+
+const char *
+committs_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case COMMITTS_ZEROPAGE:
+			id = "ZEROPAGE";
+			break;
+		case COMMITTS_TRUNCATE:
+			id = "TRUNCATE";
+			break;
+		case COMMITTS_SETTS:
+			id = "SETTS";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e0957ff..1333244 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 		appendStringInfo(buf, "redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest CommitTs xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index eb6cfc5..ace913e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xlogreader.o xlogutils.o
+	xlogreader.o xlogutils.o committs.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 27ca4c6..3300f84 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -152,8 +152,7 @@ TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
 		   status == TRANSACTION_STATUS_ABORTED);
 
 	/*
-	 * See how many subxids, if any, are on the same page as the parent, if
-	 * any.
+	 * See how many subxids, if any, are on the same page as the parent.
 	 */
 	for (i = 0; i < nsubxids; i++)
 	{
diff --git a/src/backend/access/transam/committs.c b/src/backend/access/transam/committs.c
new file mode 100644
index 0000000..e7298a5
--- /dev/null
+++ b/src/backend/access/transam/committs.c
@@ -0,0 +1,846 @@
+/*-------------------------------------------------------------------------
+ *
+ * committs.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/committs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/* We need 8+4 bytes per xact */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	CommitExtraData	extra;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, extra) + \
+									sizeof(CommitExtraData))
+
+#define COMMITTS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMITTS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMITTS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variables */
+bool	commit_ts_enabled;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz committs,
+					 CommitExtraData extra, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+						  CommitExtraData extra, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitExtraData data);
+
+
+/*
+ * TransactionTreeSetCommitTimestamp
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+								  TransactionId *subxids, TimestampTz timestamp,
+								  CommitExtraData extra, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!commit_ts_enabled)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, extra);
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, extra,
+							 pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.extra = extra;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz committs,
+					 CommitExtraData extra, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, committs, extra, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], committs, extra, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+						 CommitExtraData extra, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry *entry;
+
+	entry = (CommitTimestampEntry *)
+		(CommitTsCtl->shared->page_buffer[slotno] +
+		 SizeOfCommitTimestampEntry * entryno);
+
+	entry->time = committs;
+	entry->extra = extra;
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ */
+void
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitExtraData *data)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry *entry;
+	TransactionId oldestCommitTs;
+
+	/* Return empty if module not enabled */
+	if (!commit_ts_enabled)
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (data)
+			*data = (CommitExtraData) 0;
+		return;
+	}
+
+	/* Also return empty if the requested value is older than what we have */
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs))
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (data)
+			*data = (CommitExtraData) 0;
+		return;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory;
+	 * if it's a hit, acquire a lock and read the data, after verifying
+	 * that it's still what we initially read.  Otherwise, fall through
+	 * to read from SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (data)
+				*data = commitTsShared->dataLastCommit.extra;
+			LWLockRelease(CommitTsLock);
+			return;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	entry = (CommitTimestampEntry *)
+		(CommitTsCtl->shared->page_buffer[slotno] +
+		 SizeOfCommitTimestampEntry * entryno);
+
+	if (ts)
+		*ts = entry->time;
+
+	if (data)
+		*data = entry->extra;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTimestampData(TimestampTz *ts, CommitExtraData *extra)
+{
+	TransactionId	xid;
+
+	/* Return empty if module not enabled */
+	if (!commit_ts_enabled)
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (extra)
+			*extra = (CommitExtraData) 0;
+		return InvalidTransactionId;
+	}
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (extra)
+		*extra = commitTsShared->dataLastCommit.extra;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+PG_FUNCTION_INFO_V1(pg_get_transaction_committime);
+Datum
+pg_get_transaction_committime(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		committs;
+
+	TransactionIdGetCommitTsData(xid, &committs, NULL);
+
+	PG_RETURN_TIMESTAMPTZ(committs);
+}
+
+PG_FUNCTION_INFO_V1(pg_get_transaction_extradata);
+Datum
+pg_get_transaction_extradata(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	CommitExtraData	data;
+
+	TransactionIdGetCommitTsData(xid, NULL, &data);
+
+	PG_RETURN_INT32(data);
+}
+
+PG_FUNCTION_INFO_V1(pg_get_transaction_committime_data);
+Datum
+pg_get_transaction_committime_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		committs;
+	CommitExtraData	data;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "extra",
+					   INT4OID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	TransactionIdGetCommitTsData(xid, &committs, &data);
+
+	values[0] = TimestampTzGetDatum(committs);
+	nulls[0] = false;
+
+	values[1] = Int32GetDatum(data);
+	nulls[1] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+PG_FUNCTION_INFO_V1(pg_get_latest_transaction_committime_data);
+Datum
+pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		committs;
+	CommitExtraData	data;
+	Datum       values[3];
+	bool        nulls[3];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(3, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "extra",
+					   INT4OID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTimestampData(&committs, &data);
+
+	values[0] = TransactionIdGetDatum(xid);
+	nulls[0] = false;
+
+	values[1] = TimestampTzGetDatum(committs);
+	nulls[1] = false;
+
+	values[2] = Int32GetDatum(data);
+	nulls[2] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Initialization of shared memory for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_committs");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		commitTsShared->dataLastCommit.time = 0;
+		commitTsShared->dataLastCommit.extra = 0;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.
+ * Must be called after recovery has finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+InitCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!commit_ts_enabled)
+	{
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsControlLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest value to the next Xid; that way, we will not try
+	 * to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+	}
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!commit_ts_enabled)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the earliest value for which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMITTS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMITTS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	(void) XLogInsert(RM_COMMITTS_ID, COMMITTS_ZEROPAGE, &rdata);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMITTS_ID, COMMITTS_TRUNCATE, &rdata);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitExtraData data)
+{
+	XLogRecData	rdata;
+	xl_committs_set	record;
+
+	record.timestamp = timestamp;
+	record.data = data;
+	record.mainxid = mainxid;
+	record.nsubxids = nsubxids;
+	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+
+	rdata.data = (char *) &record;
+	rdata.len = offsetof(xl_committs_set, subxids) +
+		nsubxids * sizeof(TransactionId);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMITTS_ID, COMMITTS_SETTS, &rdata);
+}
+
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+committs_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in committs records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+	if (info == COMMITTS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMITTS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMITTS_SETTS)
+	{
+		xl_committs_set *setts = (xl_committs_set *) XLogRecGetData(record);
+
+		TransactionTreeSetCommitTimestamp(setts->mainxid, setts->nsubxids,
+										  setts->subxids, setts->timestamp,
+										  setts->data, false);
+	}
+	else
+		elog(PANIC, "committs_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 2645a7a..53116f6 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 7013fb8..c70bebe 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -157,9 +158,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_committs too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5b5d31b..ca5d28f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1166,6 +1167,17 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We don't need to log the commit timestamp separately since the commit
+	 * record logged above has all the necessary action to set the timestamp
+	 * again.
+	 */
+	if (markXidCommitted)
+	{
+		TransactionTreeSetCommitTimestamp(xid, nchildren, children,
+										  xactStopTimestamp, 0, false);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4683,6 +4695,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4710,6 +4723,10 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit time */
+	TransactionTreeSetCommitTimestamp(xid, nsubxacts, sub_xids,
+									  commit_time, 0, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4829,7 +4846,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4844,7 +4862,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 235b442..2901d26 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4945,6 +4946,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4954,6 +4956,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -5035,6 +5038,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -6281,6 +6285,9 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("oldest CommitTs Xid: %u",
+					checkPoint.oldestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -6292,6 +6299,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6513,11 +6521,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans
+			 * only. MultiXact has already been started up and other SLRUs are
+			 * not maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -7164,12 +7173,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -7205,6 +7215,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enables, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	InitCommitTs();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7750,6 +7766,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -8101,6 +8118,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -8386,6 +8407,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e5fefa3..f5e7ddc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1055,6 +1056,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * multixacts; that will be done by the next checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1064,6 +1066,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	SetCommitTsLimit(frozenXID);
 }
 
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f1b20e..f9b49c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -132,6 +132,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
+		case RM_COMMITTS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..9025601 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ea82882..fb0e20d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,7 @@
 #include <signal.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 9fe6855..6794eed 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* committs.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8111b93..94081a2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -836,6 +837,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&commit_ts_enabled,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dac6776..5e3e776 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -227,6 +227,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 0	# max number of replication slots
+#track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
 
 # - Master Server -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c8ff2cb..3935bab 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,6 +185,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_committs",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 118e653..8dc3e00 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -240,6 +240,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 028a1f0..8744d04 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -62,6 +62,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_committs = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -111,7 +112,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:c:")) != -1)
 	{
 		switch (c)
 		{
@@ -157,6 +158,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_committs = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_committs == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -333,6 +349,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_committs != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_committs;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -609,6 +628,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -690,6 +711,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_committs != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -1092,6 +1119,7 @@ usage(void)
 	printf(_("  -O OFFSET        set next multitransaction offset\n"));
 	printf(_("  -V, --version    output version information, then exit\n"));
 	printf(_("  -x XID           set next transaction ID\n"));
+	printf(_("  -c XID           set the oldest retrievable commit timestamp\n"));
 	printf(_("  -?, --help       show this help, then exit\n"));
 	printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));
 }
diff --git a/src/include/access/committs.h b/src/include/access/committs.h
new file mode 100644
index 0000000..0f96185
--- /dev/null
+++ b/src/include/access/committs.h
@@ -0,0 +1,63 @@
+/*
+ * committs.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/committs.h
+ */
+#ifndef COMMITTS_H
+#define COMMITTS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+
+
+extern PGDLLIMPORT bool	commit_ts_enabled;
+
+typedef uint32 CommitExtraData;
+
+extern void TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+								  TransactionId *subxids,
+								  TimestampTz timestamp,
+								  CommitExtraData data,
+								  bool do_xlog);
+extern void TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitExtraData *data);
+extern TransactionId GetLatestCommitTimestampData(TimestampTz *ts,
+							 CommitExtraData *extra);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void InitCommitTs(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMITTS_ZEROPAGE		0x00
+#define COMMITTS_TRUNCATE		0x10
+#define COMMITTS_SETTS			0x20
+
+typedef struct xl_committs_set
+{
+	TimestampTz		timestamp;
+	CommitExtraData	data;
+	TransactionId	mainxid;
+	int				nsubxids;
+	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+} xl_committs_set;
+
+
+extern void committs_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void committs_desc(StringInfo buf, XLogRecord *record);
+extern const char *committs_identify(uint8 info);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 77d4574..c648a6a 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,7 +24,7 @@
  * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, startup, cleanup */
+/* symbol name, textual name, redo, desc, identify, startup, cleanup */
 PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
 PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
 PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
@@ -42,3 +42,4 @@ PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gi
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
+PG_RMGR(RM_COMMITTS_ID, "CommitTs", committs_redo, committs_desc, committs_identify, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..b59fd98 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,11 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsControlLock
+	 */
+	TransactionId oldestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ba79d25..9e048ea 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,7 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 4736532..36dd72f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2988,6 +2988,18 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3787 ( pg_get_transaction_committime PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_get_transaction_committime _null_ _null_ _null_ ));
+DESCR("get commit time of transaction");
+
+DATA(insert OID = 3788 ( pg_get_transaction_extradata PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 23 "28" _null_ _null_ _null_ _null_ pg_get_transaction_extradata _null_ _null_ _null_ ));
+DESCR("get additional data from transaction commit timestamp record");
+
+DATA(insert OID = 3789 ( pg_get_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 2249 "28" "{28,1184,23}" "{i,o,o}" "{xid,committime,extradata}" _null_ pg_get_transaction_committime_data _null_ _null_ _null_ ));
+DESCR("get commit time and additional data from transaction commit timestamp record");
+
+DATA(insert OID = 3790 ( pg_get_latest_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184,23}" "{o,o,o}" "{xid,committime,extradata}" _null_ pg_get_latest_transaction_committime_data _null_ _null_ _null_ ));
+DESCR("get transaction Id, commit timestamp and additional data of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 02c8f1a..20d79a4 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fb1b4a4..5be1631 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1180,6 +1180,12 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_get_transaction_committime(PG_FUNCTION_ARGS);
+extern Datum pg_get_transaction_extradata(PG_FUNCTION_ARGS);
+extern Datum pg_get_transaction_committime_data(PG_FUNCTION_ARGS);
+extern Datum pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/committs_off.out b/src/test/regress/expected/committs_off.out
new file mode 100644
index 0000000..0a94f9d
--- /dev/null
+++ b/src/test/regress/expected/committs_off.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (off)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
+----+------------------------------+----------+----------+----------
+  1 |                            0 | f        | t        | t
+  2 |                            0 | f        | t        | t
+  3 |                            0 | f        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 9902dbe..abc6800 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,7 +88,7 @@ test: privileges security_label collate matview lock replica_identity rowsecurit
 # ----------
 # Another group of parallel tests
 # ----------
-test: alter_generic misc psql async
+test: alter_generic misc psql async committs_off
 
 # rules cannot run concurrently with any test that creates a view
 test: rules
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2902a05..d190ad2 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -147,3 +147,4 @@ test: largeobject
 test: with
 test: xml
 test: stats
+test: committs_off
diff --git a/src/test/regress/sql/committs_off.sql b/src/test/regress/sql/committs_off.sql
new file mode 100644
index 0000000..0f97666
--- /dev/null
+++ b/src/test/regress/sql/committs_off.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (off)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Petr Jelinek (#11)
Re: tracking commit timestamps

On 13 October 2014 10:05, Petr Jelinek <petr@2ndquadrant.com> wrote:

I worked bit on this patch to make it closer to committable state.

Here is updated version that works with current HEAD for the October
committfest.

I've reviewed this and it looks good to me. Clean, follows existing
code neatly, documented and tested.

Please could you document a few things

* ExtendCommitTS() works only because commit_ts_enabled can only be
set at server start.
We need that documented so somebody doesn't make it more easily
enabled and break something.
(Could we make it enabled at next checkpoint or similar?)

* The SLRU tracks timestamps of both xids and subxids. We need to
document that it does this because Subtrans SLRU is not persistent. If
we made Subtrans persistent we might need to store only the top level
xid's commitTS, but that's very useful for typical use cases and
wouldn't save much time at commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#12)
Re: tracking commit timestamps

On Tue, Oct 28, 2014 at 9:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 13 October 2014 10:05, Petr Jelinek <petr@2ndquadrant.com> wrote:

I worked bit on this patch to make it closer to committable state.

Here is updated version that works with current HEAD for the October
committfest.

I've reviewed this and it looks good to me. Clean, follows existing
code neatly, documented and tested.

Please could you document a few things

* ExtendCommitTS() works only because commit_ts_enabled can only be
set at server start.
We need that documented so somebody doesn't make it more easily
enabled and break something.
(Could we make it enabled at next checkpoint or similar?)

* The SLRU tracks timestamps of both xids and subxids. We need to
document that it does this because Subtrans SLRU is not persistent. If
we made Subtrans persistent we might need to store only the top level
xid's commitTS, but that's very useful for typical use cases and
wouldn't save much time at commit.

Hm. What is the performance impact of this feature using the latest version
of this patch? I imagine that the penalty of the additional operations this
feature introduces is not zero, particularly for loads with lots of short
transactions.
--
Michael

#14Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#13)
Re: tracking commit timestamps

On 2014-10-31 14:55:11 +0900, Michael Paquier wrote:

On Tue, Oct 28, 2014 at 9:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 13 October 2014 10:05, Petr Jelinek <petr@2ndquadrant.com> wrote:

I worked bit on this patch to make it closer to committable state.

Here is updated version that works with current HEAD for the October
committfest.

I've reviewed this and it looks good to me. Clean, follows existing
code neatly, documented and tested.

Please could you document a few things

* ExtendCommitTS() works only because commit_ts_enabled can only be
set at server start.
We need that documented so somebody doesn't make it more easily
enabled and break something.
(Could we make it enabled at next checkpoint or similar?)

* The SLRU tracks timestamps of both xids and subxids. We need to
document that it does this because Subtrans SLRU is not persistent. If
we made Subtrans persistent we might need to store only the top level
xid's commitTS, but that's very useful for typical use cases and
wouldn't save much time at commit.

Hm. What is the performance impact of this feature using the latest version
of this patch?

I haven't measured it recently, but it wasn't large, but measureable.

I imagine that the penalty of the additional operations this
feature introduces is not zero, particularly for loads with lots of short
transactions.

Which is why you can disable it...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#7)
Re: tracking commit timestamps

On Tue, Dec 10, 2013 at 2:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Speaking of the functionality this does offer, it seems pretty limited. A
commit timestamp is nice, but it isn't very interesting on its own. You
really also want to know what the transaction did, who ran it, etc. ISTM
some kind of a auditing or log-parsing system that could tell you all that
would be much more useful, but this patch doesn't get us any closer to that.

For what it's worth, I think that this has been requested numerous
times over the years by numerous developers of replication solutions.
My main question (apart from whether or not it may have bugs) is
whether it makes a noticeable performance difference. If it does,
that sucks. If it does not, maybe we ought to enable it by default.

+1

It's also requested now and then in the context of auditing and
forensic analysis of application problems. But I also agree that the
tolerance for performance overhead is got to be quite low. If a GUC
is introduced to manage the tradeoff, it should be defaulted to 'on'.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#15)
Re: tracking commit timestamps

Merlin Moncure <mmoncure@gmail.com> writes:

It's also requested now and then in the context of auditing and
forensic analysis of application problems. But I also agree that the
tolerance for performance overhead is got to be quite low. If a GUC
is introduced to manage the tradeoff, it should be defaulted to 'on'.

Alvaro's original submission specified that the feature defaults to "off".
Since there's no use-case for it unless you've installed some third-party
code (eg an external replication solution), I think that should stay true.
The feature's overhead might be small, but it is most certainly not zero,
and people shouldn't be paying for it unless they need it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Petr Jelinek
petr@2ndquadrant.com
In reply to: Tom Lane (#16)
Re: tracking commit timestamps

On 31/10/14 15:07, Tom Lane wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

It's also requested now and then in the context of auditing and
forensic analysis of application problems. But I also agree that the
tolerance for performance overhead is got to be quite low. If a GUC
is introduced to manage the tradeoff, it should be defaulted to 'on'.

Alvaro's original submission specified that the feature defaults to "off".
Since there's no use-case for it unless you've installed some third-party
code (eg an external replication solution), I think that should stay true.
The feature's overhead might be small, but it is most certainly not zero,
and people shouldn't be paying for it unless they need it.

Agreed, that's why it stayed 'off' in my version too.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Petr Jelinek
petr@2ndquadrant.com
In reply to: Simon Riggs (#12)
1 attachment(s)
Re: tracking commit timestamps

Hi,

On 28/10/14 13:25, Simon Riggs wrote:

On 13 October 2014 10:05, Petr Jelinek <petr@2ndquadrant.com> wrote:

I worked bit on this patch to make it closer to committable state.

Here is updated version that works with current HEAD for the October
committfest.

I've reviewed this and it looks good to me. Clean, follows existing
code neatly, documented and tested.

Thanks for looking at this.

Please could you document a few things

* ExtendCommitTS() works only because commit_ts_enabled can only be
set at server start.
We need that documented so somebody doesn't make it more easily
enabled and break something.
(Could we make it enabled at next checkpoint or similar?)

Maybe we could, but that means some kind of two step enabling facility
and I don't want to write that as part of the initial patch since that
will need separate discussion (i.e. do we want to have new GUC flag for
this, or hack solution just for committs?). So maybe later in a
follow-up patch.

* The SLRU tracks timestamps of both xids and subxids. We need to
document that it does this because Subtrans SLRU is not persistent. If
we made Subtrans persistent we might need to store only the top level
xid's commitTS, but that's very useful for typical use cases and
wouldn't save much time at commit.

Attached version with the above comments near the relevant code.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v7.patchtext/x-diff; name=committs-v7.patchDownload
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index bfb3573..c0a0409 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -9,6 +9,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 0000000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	    --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	    --temp-install=./tmp_check \
+	    --extra-install=contrib/test_committs \
+	    --outputdir=./regression_output \
+	    $(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	    --extra-install=contrib/test_committs \
+	    $(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 0000000..9920343
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (on)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
+----+------------------------------+----------+----------+----------
+  1 |                            0 | t        | t        | t
+  2 |                            0 | t        | t        | t
+  3 |                            0 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/contrib/test_committs/sql/committs_on.sql b/contrib/test_committs/sql/committs_on.sql
new file mode 100644
index 0000000..aec6438
--- /dev/null
+++ b/contrib/test_committs/sql/committs_on.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (on)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 47b1192..96fb720 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,21 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions.  This parameter
+        can only be set in
+        the <filename>postgresql.conf</> file or on the server command line.
+        The default value is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7e5bcd9..13d6fc5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15910,6 +15910,48 @@ SELECT collation for ('foo' COLLATE "de_DE");
     For example <literal>10:20:10,14,15</literal> means
     <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
    </para>
+
+   <para>
+    The functions shown in <xref linkend="functions-committs">
+    provide information about transactions that have been already committed.
+    These functions mainly provide information about when the transactions
+    were committed. They only provide useful data when
+    <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+    and only for transactions that were committed after it was enabled.
+   </para>
+
+   <table id="functions-committs">
+    <title>Committed transaction information</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry><literal><function>pg_get_transaction_committime(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>get commit time of transaction</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_get_transaction_extradata(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>integer</type></entry>
+       <entry>get additional data from transaction commit timestamp record</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_get_transaction_committime_data(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><parameter>committime</> <type>timestamp with time zone</>, <parameter>extradata</> <type>integer</></entry>
+       <entry>get commit time and additional data from transaction commit timestamp</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_get_latest_transaction_committime_data()</function></literal></entry>
+       <entry><parameter>xid</> <type>xid</>, <parameter>committime</> <type>timestamp with time zone</>, <parameter>extradata</> <type>integer</></entry>
+       <entry>get transaction Id, commit timestamp and additional data of latest transaction commit</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   </sect1>
 
   <sect1 id="functions-admin">
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 7d092d2..20c88a8 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o \
+       heapdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..2bf7fed
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "utils/timestamp.h"
+
+
+void
+committs_desc(StringInfo buf, XLogRecord *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	if (info == COMMITTS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "zeropage: %d", pageno);
+	}
+	else if (info == COMMITTS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "truncate before: %d", pageno);
+	}
+	else if (info == COMMITTS_SETTS)
+	{
+		xl_committs_set *xlrec = (xl_committs_set *) rec;
+		int		i;
+
+		appendStringInfo(buf, "set committs %s for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->mainxid);
+		for (i = 0; i < xlrec->nsubxids; i++)
+			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+	}
+	else
+		appendStringInfo(buf, "UNKNOWN");
+}
+
+const char *
+committs_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case COMMITTS_ZEROPAGE:
+			id = "ZEROPAGE";
+			break;
+		case COMMITTS_TRUNCATE:
+			id = "TRUNCATE";
+			break;
+		case COMMITTS_SETTS:
+			id = "SETTS";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e0957ff..1333244 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 		appendStringInfo(buf, "redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest CommitTs xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index eb6cfc5..ace913e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xlogreader.o xlogutils.o
+	xlogreader.o xlogutils.o committs.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 27ca4c6..3300f84 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -152,8 +152,7 @@ TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
 		   status == TRANSACTION_STATUS_ABORTED);
 
 	/*
-	 * See how many subxids, if any, are on the same page as the parent, if
-	 * any.
+	 * See how many subxids, if any, are on the same page as the parent.
 	 */
 	for (i = 0; i < nsubxids; i++)
 	{
diff --git a/src/backend/access/transam/committs.c b/src/backend/access/transam/committs.c
new file mode 100644
index 0000000..e17dd4f
--- /dev/null
+++ b/src/backend/access/transam/committs.c
@@ -0,0 +1,855 @@
+/*-------------------------------------------------------------------------
+ *
+ * committs.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/committs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/* We need 8+4 bytes per xact */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	CommitExtraData	extra;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, extra) + \
+									sizeof(CommitExtraData))
+
+#define COMMITTS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMITTS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMITTS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variables */
+bool	commit_ts_enabled;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz committs,
+					 CommitExtraData extra, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+						  CommitExtraData extra, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitExtraData data);
+
+
+/*
+ * TransactionTreeSetCommitTimestamp
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ * The reason why tracking just the parent xid committs is not enough is that
+ * the subtrans SLRU does not stay valid across crashes (is not permanent) so we
+ * need to keep the information about them here. If the subtrans implementation
+ * changes in the future, we might want to revisit the decision of storing
+ * committs for each subxid.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+								  TransactionId *subxids, TimestampTz timestamp,
+								  CommitExtraData extra, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!commit_ts_enabled)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, extra);
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, extra,
+							 pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.extra = extra;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz committs,
+					 CommitExtraData extra, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, committs, extra, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], committs, extra, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz committs,
+						 CommitExtraData extra, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry *entry;
+
+	entry = (CommitTimestampEntry *)
+		(CommitTsCtl->shared->page_buffer[slotno] +
+		 SizeOfCommitTimestampEntry * entryno);
+
+	entry->time = committs;
+	entry->extra = extra;
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ */
+void
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitExtraData *data)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry *entry;
+	TransactionId oldestCommitTs;
+
+	/* Return empty if module not enabled */
+	if (!commit_ts_enabled)
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (data)
+			*data = (CommitExtraData) 0;
+		return;
+	}
+
+	/* Also return empty if the requested value is older than what we have */
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs))
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (data)
+			*data = (CommitExtraData) 0;
+		return;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory;
+	 * if it's a hit, acquire a lock and read the data, after verifying
+	 * that it's still what we initially read.  Otherwise, fall through
+	 * to read from SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (data)
+				*data = commitTsShared->dataLastCommit.extra;
+			LWLockRelease(CommitTsLock);
+			return;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	entry = (CommitTimestampEntry *)
+		(CommitTsCtl->shared->page_buffer[slotno] +
+		 SizeOfCommitTimestampEntry * entryno);
+
+	if (ts)
+		*ts = entry->time;
+
+	if (data)
+		*data = entry->extra;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTimestampData(TimestampTz *ts, CommitExtraData *extra)
+{
+	TransactionId	xid;
+
+	/* Return empty if module not enabled */
+	if (!commit_ts_enabled)
+	{
+		if (ts)
+			*ts = InvalidTransactionId;
+		if (extra)
+			*extra = (CommitExtraData) 0;
+		return InvalidTransactionId;
+	}
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (extra)
+		*extra = commitTsShared->dataLastCommit.extra;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+PG_FUNCTION_INFO_V1(pg_get_transaction_committime);
+Datum
+pg_get_transaction_committime(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		committs;
+
+	TransactionIdGetCommitTsData(xid, &committs, NULL);
+
+	PG_RETURN_TIMESTAMPTZ(committs);
+}
+
+PG_FUNCTION_INFO_V1(pg_get_transaction_extradata);
+Datum
+pg_get_transaction_extradata(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	CommitExtraData	data;
+
+	TransactionIdGetCommitTsData(xid, NULL, &data);
+
+	PG_RETURN_INT32(data);
+}
+
+PG_FUNCTION_INFO_V1(pg_get_transaction_committime_data);
+Datum
+pg_get_transaction_committime_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		committs;
+	CommitExtraData	data;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "extra",
+					   INT4OID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	TransactionIdGetCommitTsData(xid, &committs, &data);
+
+	values[0] = TimestampTzGetDatum(committs);
+	nulls[0] = false;
+
+	values[1] = Int32GetDatum(data);
+	nulls[1] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+PG_FUNCTION_INFO_V1(pg_get_latest_transaction_committime_data);
+Datum
+pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		committs;
+	CommitExtraData	data;
+	Datum       values[3];
+	bool        nulls[3];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(3, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "extra",
+					   INT4OID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTimestampData(&committs, &data);
+
+	values[0] = TransactionIdGetDatum(xid);
+	nulls[0] = false;
+
+	values[1] = TimestampTzGetDatum(committs);
+	nulls[1] = false;
+
+	values[2] = Int32GetDatum(data);
+	nulls[2] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Initialization of shared memory for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_committs");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		commitTsShared->dataLastCommit.time = 0;
+		commitTsShared->dataLastCommit.extra = 0;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.
+ * Must be called after recovery has finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+InitCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!commit_ts_enabled)
+	{
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsControlLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest value to the next Xid; that way, we will not try
+	 * to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+	}
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ *
+ * NB2: the current implementation relies on the fact that
+ * track_commit_timestamp is flagged as PGC_POSTMASTER
+ * (only possible to be set at server start).
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!commit_ts_enabled)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the earliest value for which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMITTS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMITTS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	(void) XLogInsert(RM_COMMITTS_ID, COMMITTS_ZEROPAGE, &rdata);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMITTS_ID, COMMITTS_TRUNCATE, &rdata);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitExtraData data)
+{
+	XLogRecData	rdata;
+	xl_committs_set	record;
+
+	record.timestamp = timestamp;
+	record.data = data;
+	record.mainxid = mainxid;
+	record.nsubxids = nsubxids;
+	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+
+	rdata.data = (char *) &record;
+	rdata.len = offsetof(xl_committs_set, subxids) +
+		nsubxids * sizeof(TransactionId);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMITTS_ID, COMMITTS_SETTS, &rdata);
+}
+
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+committs_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in committs records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+	if (info == COMMITTS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMITTS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMITTS_SETTS)
+	{
+		xl_committs_set *setts = (xl_committs_set *) XLogRecGetData(record);
+
+		TransactionTreeSetCommitTimestamp(setts->mainxid, setts->nsubxids,
+										  setts->subxids, setts->timestamp,
+										  setts->data, false);
+	}
+	else
+		elog(PANIC, "committs_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 2645a7a..53116f6 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 7013fb8..c70bebe 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -157,9 +158,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_committs too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 651a5c4..3cc2330 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1166,6 +1167,17 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We don't need to log the commit timestamp separately since the commit
+	 * record logged above has all the necessary action to set the timestamp
+	 * again.
+	 */
+	if (markXidCommitted)
+	{
+		TransactionTreeSetCommitTimestamp(xid, nchildren, children,
+										  xactStopTimestamp, 0, false);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4681,6 +4693,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4708,6 +4721,10 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit time */
+	TransactionTreeSetCommitTimestamp(xid, nsubxacts, sub_xids,
+									  commit_time, 0, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4827,7 +4844,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4842,7 +4860,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3c9aeae..03dfeb2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4945,6 +4946,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4954,6 +4956,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -5035,6 +5038,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -6283,6 +6287,9 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("oldest CommitTs Xid: %u",
+					checkPoint.oldestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -6294,6 +6301,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6515,11 +6523,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans
+			 * only. MultiXact has already been started up and other SLRUs are
+			 * not maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -7166,12 +7175,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -7207,6 +7217,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enables, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	InitCommitTs();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7752,6 +7768,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -8079,6 +8096,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -8364,6 +8385,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0dc92ba..b10d44d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1072,6 +1073,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * multixacts; that will be done by the next checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1081,6 +1083,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	SetCommitTsLimit(frozenXID);
 }
 
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f1b20e..f9b49c4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -132,6 +132,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
+		case RM_COMMITTS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..9025601 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ea82882..fb0e20d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,7 @@
 #include <signal.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 719181c..4b4b4bf 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* committs.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d7142d2..f61e152 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -836,6 +837,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&commit_ts_enabled,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dac6776..5e3e776 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -227,6 +227,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 0	# max number of replication slots
+#track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
 
 # - Master Server -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index dc1f1df..e577c12 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,6 +185,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_committs",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 32cc100..ff162fd 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -270,6 +270,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index f4c1eaf..8abd5c2 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -62,6 +62,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_committs = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -111,7 +112,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:c:")) != -1)
 	{
 		switch (c)
 		{
@@ -157,6 +158,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_committs = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_committs == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -344,6 +360,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_committs != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_committs;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -620,6 +639,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -701,6 +722,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_committs != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -1103,6 +1130,7 @@ usage(void)
 	printf(_("  -O OFFSET        set next multitransaction offset\n"));
 	printf(_("  -V, --version    output version information, then exit\n"));
 	printf(_("  -x XID           set next transaction ID\n"));
+	printf(_("  -c XID           set the oldest retrievable commit timestamp\n"));
 	printf(_("  -?, --help       show this help, then exit\n"));
 	printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));
 }
diff --git a/src/include/access/committs.h b/src/include/access/committs.h
new file mode 100644
index 0000000..0f96185
--- /dev/null
+++ b/src/include/access/committs.h
@@ -0,0 +1,63 @@
+/*
+ * committs.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/committs.h
+ */
+#ifndef COMMITTS_H
+#define COMMITTS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+
+
+extern PGDLLIMPORT bool	commit_ts_enabled;
+
+typedef uint32 CommitExtraData;
+
+extern void TransactionTreeSetCommitTimestamp(TransactionId xid, int nsubxids,
+								  TransactionId *subxids,
+								  TimestampTz timestamp,
+								  CommitExtraData data,
+								  bool do_xlog);
+extern void TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitExtraData *data);
+extern TransactionId GetLatestCommitTimestampData(TimestampTz *ts,
+							 CommitExtraData *extra);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void InitCommitTs(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMITTS_ZEROPAGE		0x00
+#define COMMITTS_TRUNCATE		0x10
+#define COMMITTS_SETTS			0x20
+
+typedef struct xl_committs_set
+{
+	TimestampTz		timestamp;
+	CommitExtraData	data;
+	TransactionId	mainxid;
+	int				nsubxids;
+	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+} xl_committs_set;
+
+
+extern void committs_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void committs_desc(StringInfo buf, XLogRecord *record);
+extern const char *committs_identify(uint8 info);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 77d4574..c648a6a 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,7 +24,7 @@
  * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, startup, cleanup */
+/* symbol name, textual name, redo, desc, identify, startup, cleanup */
 PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
 PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
 PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
@@ -42,3 +42,4 @@ PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gi
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
+PG_RMGR(RM_COMMITTS_ID, "CommitTs", committs_redo, committs_desc, committs_identify, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..b59fd98 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,11 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsControlLock
+	 */
+	TransactionId oldestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ba79d25..9e048ea 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,7 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index b6dc1b8..a28ac39 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2988,6 +2988,18 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3787 ( pg_get_transaction_committime PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_get_transaction_committime _null_ _null_ _null_ ));
+DESCR("get commit time of transaction");
+
+DATA(insert OID = 3788 ( pg_get_transaction_extradata PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 23 "28" _null_ _null_ _null_ _null_ pg_get_transaction_extradata _null_ _null_ _null_ ));
+DESCR("get additional data from transaction commit timestamp record");
+
+DATA(insert OID = 3789 ( pg_get_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 2249 "28" "{28,1184,23}" "{i,o,o}" "{xid,committime,extradata}" _null_ pg_get_transaction_committime_data _null_ _null_ _null_ ));
+DESCR("get commit time and additional data from transaction commit timestamp record");
+
+DATA(insert OID = 3790 ( pg_get_latest_transaction_committime_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184,23}" "{o,o,o}" "{xid,committime,extradata}" _null_ pg_get_latest_transaction_committime_data _null_ _null_ _null_ ));
+DESCR("get transaction Id, commit timestamp and additional data of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 91cab87..09654a8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fb1b4a4..5be1631 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1180,6 +1180,12 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_get_transaction_committime(PG_FUNCTION_ARGS);
+extern Datum pg_get_transaction_extradata(PG_FUNCTION_ARGS);
+extern Datum pg_get_transaction_committime_data(PG_FUNCTION_ARGS);
+extern Datum pg_get_latest_transaction_committime_data(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/committs_off.out b/src/test/regress/expected/committs_off.out
new file mode 100644
index 0000000..0a94f9d
--- /dev/null
+++ b/src/test/regress/expected/committs_off.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (off)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
+----+------------------------------+----------+----------+----------
+  1 |                            0 | f        | t        | t
+  2 |                            0 | f        | t        | t
+  3 |                            0 | f        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 9902dbe..abc6800 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,7 +88,7 @@ test: privileges security_label collate matview lock replica_identity rowsecurit
 # ----------
 # Another group of parallel tests
 # ----------
-test: alter_generic misc psql async
+test: alter_generic misc psql async committs_off
 
 # rules cannot run concurrently with any test that creates a view
 test: rules
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2902a05..d190ad2 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -147,3 +147,4 @@ test: largeobject
 test: with
 test: xml
 test: stats
+test: committs_off
diff --git a/src/test/regress/sql/committs_off.sql b/src/test/regress/sql/committs_off.sql
new file mode 100644
index 0000000..0f97666
--- /dev/null
+++ b/src/test/regress/sql/committs_off.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (off)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id, pg_get_transaction_extradata(xmin),
+       pg_get_transaction_committime(xmin) >= ts,
+       pg_get_transaction_committime(xmin) < now(),
+       pg_get_transaction_committime(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Petr Jelinek (#18)
Re: tracking commit timestamps

On 31 October 2014 15:46, Petr Jelinek <petr@2ndquadrant.com> wrote:

Attached version with the above comments near the relevant code.

Looks cooked and ready to serve. Who's gonna commit this? Alvaro, or
do you want me to?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#19)
Re: tracking commit timestamps

On Sat, Nov 1, 2014 at 1:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 31 October 2014 15:46, Petr Jelinek <petr@2ndquadrant.com> wrote:

Attached version with the above comments near the relevant code.

Looks cooked and ready to serve. Who's gonna commit this? Alvaro, or
do you want me to?

Could you hold on a bit? I'd like to have a look at it more deeply and by
looking at quickly at the code there are a couple of things that could be
improved.
Regards,
--
Michael

#21Michael Paquier
michael.paquier@gmail.com
In reply to: Petr Jelinek (#18)
Re: tracking commit timestamps

On Sat, Nov 1, 2014 at 12:46 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Hi,

On 28/10/14 13:25, Simon Riggs wrote:

On 13 October 2014 10:05, Petr Jelinek <petr@2ndquadrant.com> wrote:

I worked bit on this patch to make it closer to committable state.

Here is updated version that works with current HEAD for the October

committfest.

I've reviewed this and it looks good to me. Clean, follows existing
code neatly, documented and tested.

Thanks for looking at this.

Please could you document a few things

* ExtendCommitTS() works only because commit_ts_enabled can only be
set at server start.
We need that documented so somebody doesn't make it more easily
enabled and break something.
(Could we make it enabled at next checkpoint or similar?)

Maybe we could, but that means some kind of two step enabling facility and
I don't want to write that as part of the initial patch since that will
need separate discussion (i.e. do we want to have new GUC flag for this, or
hack solution just for committs?). So maybe later in a follow-up patch.

* The SLRU tracks timestamps of both xids and subxids. We need to

document that it does this because Subtrans SLRU is not persistent. If
we made Subtrans persistent we might need to store only the top level
xid's commitTS, but that's very useful for typical use cases and
wouldn't save much time at commit.

Attached version with the above comments near the relevant code.

On a personal note, I think that this is a useful feature, particularly
useful for replication solutions to resolve commit conflicts by using the
method of the first-transaction-that-commits-wins, but this has already
been mentioned on this thread. So yes I am a fan of it, and yes let's keep
the GUC controlling it at off by default.

Now here are a couple of comments at code level, this code seems not enough
baked for a commit:
1) The following renaming should be done:
- pg_get_transaction_committime to pg_get_transaction_commit_time
- pg_get_transaction_extradata to pg_get_transaction_extra_data
- pg_get_transaction_committime_data to pg_get_transaction_commit_time_data
- pg_get_latest_transaction_committime_data to
pg_get_latest_transaction_commit_time_data
2) This patch adds a new option -c in pg_resetxlog to set the transaction
XID of the transaction from which can be retrieved a commit timestamp, but
the documentation of pg_resetxlog is not updated.
3) General remark: committs is not a name suited IMO (ts for
transaction??). What if the code is changed to use commit_time instead?
This remark counts as well for the file names committs.c and committs.h,
and for pg_committs.
4) Nitpicky remark in pg_resetxlog, let's try to respect the alphabetical
order (not completely related to this patch), so not:
+       while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:c:")) != -1)
but:
+       while ((c = getopt(argc, argv, "c:e:D:fl:m:no:O:x:")) != -1)
5) --help message should be reworked (alphabetical order of the entries),
this will avoid some cycles of Peter as he usually spends time revisiting
and cleaning up such things:
        printf(_("  -x XID           set next transaction ID\n"));
+       printf(_("  -c XID           set the oldest retrievable commit
timestamp\n"));
        printf(_("  -?, --help       show this help, then exit\n"));
6) To be consistent with everything, shouldn't track_commit_timestamp be
renamed to track_commit_time
7) This documentation portion should be reworked from that:
+       <para>
+        Record commit time of transactions.  This parameter
+        can only be set in
+        the <filename>postgresql.conf</> file or on the server command
line.
+        The default value is off.
+       </para>
To roughly that (not the rewording and the use of <literal>):
+       <para>
+        Record commit time of transactions.  This parameter
+        can only be set in <filename>postgresql.conf</> or on the server
command line.
+        The default value is <literal>off</literal>.
+       </para>
8) Let's update this file list more consistently:
-OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o
hashdesc.o \
+       heapdesc.o \
           mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o
spgdesc.o \
           standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
9) Hm?! "oldest commit time xid", no?
-                                                "oldest running xid %u;
%s",
+                                                "oldest CommitTs xid: %u;
oldest running xid %u; %s",
10) I don't see why this diff is in the patch:
        /*
-        * See how many subxids, if any, are on the same page as the
parent, if
-        * any.
+        * See how many subxids, if any, are on the same page as the parent.
         */
11) contrib/Makefile has not been updated with the new module test_committs
that this patch introduces.
12) In committs_desc@committsdesc.c, isn't this block overthinking a bit:
+       else
+               appendStringInfo(buf, "UNKNOWN");
It may be better to remove it, no?
13) Isn't that 2014?
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
14) I'd put the two checks in the reverse order:
+       Assert(xid != InvalidTransactionId);
+
+       if (!commit_ts_enabled)
+               return;
15) The multiple calls to PG_FUNCTION_INFO_V1 in committs.c are not
necessary. Those functions are already defined in pg_proc.h.
16) make installcheck (test committs_off) fails on a server that has
track_committs set to on. You should use an alternate output. I would
recommend removing as well the _off suffix in the test name. Let's use
commit_time. Also, it should be mentioned in parallel_schedule with a
comment that this test should always run alone and never in parallel with
other tests. Honestly, I also think that test_committs brings no additional
value and results in duplication code between src/test/regress and
contrib/test_committs. So I'd just rip it off. On top of that, I think that
"SHOW track_committs" should be added in the list of commands run in the
test. We actually want to check of commit time are really registered if the
feature switch is on or off.

I am still planning to do more extensive tests, and study a bit more
committs.c (with even more comments) as it is the core part of the feature.
For now I'd recommend to hold on commit fire for this patch.
Regards,
--
Michael

#22Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#21)
Re: tracking commit timestamps

Hi,

thanks for review.

On 01/11/14 05:45, Michael Paquier wrote:

Now here are a couple of comments at code level, this code seems not
enough baked for a commit:
1) The following renaming should be done:
- pg_get_transaction_committime to pg_get_transaction_commit_time
- pg_get_transaction_extradata to pg_get_transaction_extra_data
- pg_get_transaction_committime_data to pg_get_transaction_commit_time_data
- pg_get_latest_transaction_committime_data to
pg_get_latest_transaction_commit_time_data

Makes sense.

3) General remark: committs is not a name suited IMO (ts for
transaction??). What if the code is changed to use commit_time instead?
This remark counts as well for the file names committs.c and committs.h,
and for pg_committs.

The ts is for timestamp, tx would be shorthand for transaction. Looking
at your remarks, it seems there is some general inconsistency with time
vs timestamp in this patch, we should pick one and stick with it.

6) To be consistent with everything, shouldn't track_commit_timestamp be
renamed to track_commit_time

(see above)

9) Hm?! "oldest commit time xid", no?
-                                                "oldest running xid %u;
%s",
+                                                "oldest CommitTs xid:
%u; oldest running xid %u; %s",

Again, timestamp vs time.

12) In committs_desc@committsdesc.c, isn't this block overthinking a bit:
+       else
+               appendStringInfo(buf, "UNKNOWN");
It may be better to remove it, no?

Should be safe, indeed.

13) Isn't that 2014?
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group

Hah, I forgot to update that (shows how long this patch has been waiting
:) )

16) make installcheck (test committs_off) fails on a server that has
track_committs set to on. You should use an alternate output. I would

Well, it is supposed to fail, that's the whole point, the output should
be different depending on the value of the GUC.

recommend removing as well the _off suffix in the test name. Let's use
commit_time. Also, it should be mentioned in parallel_schedule with a
comment that this test should always run alone and never in parallel
with other tests. Honestly, I also think that test_committs brings no
additional value and results in duplication code between
src/test/regress and contrib/test_committs. So I'd just rip it off. On

Those tests are different though, one tests that the default (off) works
as expected the contrib one tests that the feature when turned on works
as expected. Since we can only set config values for contrib tests I
don't see how else to do this, but I am open to suggestions.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#22)
Re: tracking commit timestamps

On 01/11/14 12:19, Petr Jelinek wrote:

Hi,

thanks for review.

On 01/11/14 05:45, Michael Paquier wrote:

Now here are a couple of comments at code level, this code seems not
enough baked for a commit:
1) The following renaming should be done:
- pg_get_transaction_committime to pg_get_transaction_commit_time
- pg_get_transaction_extradata to pg_get_transaction_extra_data
- pg_get_transaction_committime_data to
pg_get_transaction_commit_time_data
- pg_get_latest_transaction_committime_data to
pg_get_latest_transaction_commit_time_data

Makes sense.

On second thought, maybe those should be pg_get_transaction_committs,
pg_get_transaction_committs_data, etc.

For me the commit time thing feels problematic in the way I perceive it
- I see commit time as a point in time, where I see commit timestamp (or
committs for short) as something that can recorded. So I would prefer to
stick with commit timestamp/committs.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Michael Paquier
michael.paquier@gmail.com
In reply to: Petr Jelinek (#23)
Re: tracking commit timestamps

On Sat, Nov 1, 2014 at 9:04 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 01/11/14 12:19, Petr Jelinek wrote:

Hi,

thanks for review.

On 01/11/14 05:45, Michael Paquier wrote:

Now here are a couple of comments at code level, this code seems not
enough baked for a commit:
1) The following renaming should be done:
- pg_get_transaction_committime to pg_get_transaction_commit_time
- pg_get_transaction_extradata to pg_get_transaction_extra_data
- pg_get_transaction_committime_data to
pg_get_transaction_commit_time_data
- pg_get_latest_transaction_committime_data to
pg_get_latest_transaction_commit_time_data

Makes sense.

On second thought, maybe those should be pg_get_transaction_committs,
pg_get_transaction_committs_data, etc.
For me the commit time thing feels problematic in the way I perceive it -
I see commit time as a point in time, where I see commit timestamp (or
committs for short) as something that can recorded. So I would prefer to
stick with commit timestamp/committs.

Hehe, I got exactly the opposite impression while reading the patch, but
let's rely on your judgement for the namings. I am not the one writing this
code.
--
Michael

#25Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#21)
Re: tracking commit timestamps

On Sat, Nov 1, 2014 at 1:45 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

I am still planning to do more extensive tests, and study a bit more
committs.c (with even more comments) as it is the core part of the feature.

More comments:
- Heikki already mentioned it, but after reading the code I see little
point in having the extra field implementing like that in core for many
reasons even if it is *just* 4 bytes:
1) It is untested and actually there is no direct use for it in core.
2) Pushing code that we know as dead is no good, that's a feature more or
less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.
3) If you're going to re-use this API in BDR, which is a fork of Postgres.
You'd better complete this API in BDR by yourself and not bother core with
that.
For those reasons I think that this extra field should be ripped off from
the patch.
- The API to get the commit timestamp is not that user-friendly, and I
think it could really be improved, to something like that for example:
pg_get_commit_timestamp(from_xact xid, number_of_xacts int);
pg_get_commit_timestamp(from_xact xid);
pg_get_commit_timestamp(); or pg_get_latest_commit_timestamp();
from_xact to NULL means latest. number_of_xacts to NULL means 1.
Comment in line with the fact that extra field is well, not really useful.
Regards,
--
Michael

#26Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#25)
Re: tracking commit timestamps

On 01/11/14 14:00, Michael Paquier wrote:

More comments:
- Heikki already mentioned it, but after reading the code I see little
point in having the extra field implementing like that in core for many
reasons even if it is *just* 4 bytes:
1) It is untested and actually there is no direct use for it in core.
2) Pushing code that we know as dead is no good, that's a feature more
or less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.
3) If you're going to re-use this API in BDR, which is a fork of
Postgres. You'd better complete this API in BDR by yourself and not
bother core with that.
For those reasons I think that this extra field should be ripped off
from the patch.

Well this is not BDR specific thing, the idea is that with logical
replication, commit timestamp is not enough for conflict handling, you
also need to have additional info in order to identify some types of
conflicts conflicts (local update vs remote update for example). So the
extradata field was meant as something that could be used to add the
additional info to the xid.

But I see your point, I think solving this issue can be left to the
replication identifier patch that is discussed in separate thread.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#21)
Re: tracking commit timestamps

On 2014-11-01 13:45:44 +0900, Michael Paquier wrote:

14) I'd put the two checks in the reverse order:
+       Assert(xid != InvalidTransactionId);
+
+       if (!commit_ts_enabled)
+               return;

Please don't. The order is correct right now. Why you ask? This way the
correctness of the callsites is checked even when committs is
disabled. Which it'll likely be on the majority of developer setups. And
what's the upsite of changing the order? There's no difference in the
generated code in production builds and the overhead in assert enabled
ones is neglegible.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#25)
Re: tracking commit timestamps

On 2014-11-01 22:00:40 +0900, Michael Paquier wrote:

On Sat, Nov 1, 2014 at 1:45 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

I am still planning to do more extensive tests, and study a bit more
committs.c (with even more comments) as it is the core part of the feature.

More comments:
- Heikki already mentioned it, but after reading the code I see little
point in having the extra field implementing like that in core for many
reasons even if it is *just* 4 bytes:
1) It is untested and actually there is no direct use for it in core.

Meh. The whole feature is only there for extensions, not core.

2) Pushing code that we know as dead is no good, that's a feature more or
less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.

Uh. It's not more/less dead than the whole of committs.

3) If you're going to re-use this API in BDR, which is a fork of Postgres.
You'd better complete this API in BDR by yourself and not bother core with
that.

I think that's a fundamentally wrong position. The only reason BDR isn't
purely stock postgres is that some features couldn't sanely be made work
without patches. I *hate* the fact that we had to do so. And I really
hope that we don't need any of the patches we have when building against
9.5.

So, now you might argue that the additional data is useless. But I think
that's just not thought far enough. If you think about it, in which
scenarios do you want to map xids to the commit timestamp? Primarily
that's going to be replication, right? One of the most obvious usecases
is allowing to detect/analyze/resolve conflicts in a multimaster setup,
right? To make sensible decisisons you'll often want to have more
information about the involved transactions. Makes sense so far?

Now, you might argue that could just be done with some table
transaction_metadata(xid DEFAULT txid_current(), meta, data). But that
has *significant* disadvantages: For one, it'll not work correctly once
subtransactions are used. Not good. For another it has about a
magnitude higher overhead than the committs way.

And it's not like the the extra field is in any way bdr specific - even
if you actually want to store much more information about the
transaction than just the origin (which is what bdr does), you can use
it to correctly solve the subtransaction problem and refer to some
transaction metadata table.

- The API to get the commit timestamp is not that user-friendly, and I
think it could really be improved, to something like that for example:
pg_get_commit_timestamp(from_xact xid, number_of_xacts int);

What'd be the point of this?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Andres Freund
andres@2ndquadrant.com
In reply to: Petr Jelinek (#26)
Re: tracking commit timestamps

On 2014-11-01 14:41:02 +0100, Petr Jelinek wrote:

On 01/11/14 14:00, Michael Paquier wrote:

More comments:
- Heikki already mentioned it, but after reading the code I see little
point in having the extra field implementing like that in core for many
reasons even if it is *just* 4 bytes:
1) It is untested and actually there is no direct use for it in core.
2) Pushing code that we know as dead is no good, that's a feature more
or less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.
3) If you're going to re-use this API in BDR, which is a fork of
Postgres. You'd better complete this API in BDR by yourself and not
bother core with that.
For those reasons I think that this extra field should be ripped off
from the patch.

Well this is not BDR specific thing, the idea is that with logical
replication, commit timestamp is not enough for conflict handling, you also
need to have additional info in order to identify some types of conflicts
conflicts (local update vs remote update for example). So the extradata
field was meant as something that could be used to add the additional info
to the xid.

But I see your point, I think solving this issue can be left to the
replication identifier patch that is discussed in separate thread.

For me this really hasn't anything directly to do with replication
identifiers, so delaying this decision doesn't make sense to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#28)
Re: tracking commit timestamps

On 01/11/14 18:44, Andres Freund wrote:

On 2014-11-01 22:00:40 +0900, Michael Paquier wrote:

On Sat, Nov 1, 2014 at 1:45 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

I am still planning to do more extensive tests, and study a bit more
committs.c (with even more comments) as it is the core part of the feature.

More comments:
- Heikki already mentioned it, but after reading the code I see little
point in having the extra field implementing like that in core for many
reasons even if it is *just* 4 bytes:
1) It is untested and actually there is no direct use for it in core.

Meh. The whole feature is only there for extensions, not core.

2) Pushing code that we know as dead is no good, that's a feature more or
less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.

Uh. It's not more/less dead than the whole of committs.

3) If you're going to re-use this API in BDR, which is a fork of Postgres.
You'd better complete this API in BDR by yourself and not bother core with
that.

I think that's a fundamentally wrong position. The only reason BDR isn't
purely stock postgres is that some features couldn't sanely be made work
without patches. I *hate* the fact that we had to do so. And I really
hope that we don't need any of the patches we have when building against
9.5.

So, now you might argue that the additional data is useless. But I think
that's just not thought far enough. If you think about it, in which
scenarios do you want to map xids to the commit timestamp? Primarily
that's going to be replication, right? One of the most obvious usecases
is allowing to detect/analyze/resolve conflicts in a multimaster setup,
right? To make sensible decisisons you'll often want to have more
information about the involved transactions. Makes sense so far?

Now, you might argue that could just be done with some table
transaction_metadata(xid DEFAULT txid_current(), meta, data). But that
has *significant* disadvantages: For one, it'll not work correctly once
subtransactions are used. Not good. For another it has about a
magnitude higher overhead than the committs way.

And it's not like the the extra field is in any way bdr specific - even
if you actually want to store much more information about the
transaction than just the origin (which is what bdr does), you can use
it to correctly solve the subtransaction problem and refer to some
transaction metadata table.

Well, Michael has point that the extradata is pretty much useless
currently, perhaps it would help to add the interface to set extradata?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Petr Jelinek (#26)
Re: tracking commit timestamps

On 11/1/14, 8:41 AM, Petr Jelinek wrote:

Well this is not BDR specific thing, the idea is that with logical replication, commit timestamp is not enough for conflict handling, you also need to have additional info in order to identify some types of conflicts conflicts (local update vs remote update for example). So the extradata field was meant as something that could be used to add the additional info to the xid.

Related to this... is there any way to deal with 2 transactions that commit in the same microsecond? It seems silly to try and handle that for every commit since it should be quite rare, but perhaps we could store the LSN as extradata if we detect a conflict?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#31)
Re: tracking commit timestamps

Jim Nasby wrote:

On 11/1/14, 8:41 AM, Petr Jelinek wrote:

Well this is not BDR specific thing, the idea is that with logical replication, commit timestamp is not enough for conflict handling, you also need to have additional info in order to identify some types of conflicts conflicts (local update vs remote update for example). So the extradata field was meant as something that could be used to add the additional info to the xid.

Related to this... is there any way to deal with 2 transactions that commit in the same microsecond? It seems silly to try and handle that for every commit since it should be quite rare, but perhaps we could store the LSN as extradata if we detect a conflict?

Well, two things. One, LSN is 8 bytes and extradata (at least in this
patch when I last saw it) is only 4 bytes. But secondly and more
important is that detecting a conflict is done in node B *after* node A
has recorded the transaction's commit time; there is no way to know at
commit time that there is going to be a conflict caused by that
transaction in the future. (If there was a way to tell, you could just
as well not commit the transaction in the first place.)

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Peter Eisentraut
peter_e@gmx.net
In reply to: Petr Jelinek (#23)
Re: tracking commit timestamps

On 11/1/14 8:04 AM, Petr Jelinek wrote:

On second thought, maybe those should be pg_get_transaction_committs,
pg_get_transaction_committs_data, etc.

Please don't name anything "committs". That looks like a misspelling of
something.

There is nothing wrong with

pg_get_transaction_commit_timestamp()

If you want to reduce the length, lose the "get".

For me the commit time thing feels problematic in the way I perceive it
- I see commit time as a point in time, where I see commit timestamp (or
committs for short) as something that can recorded. So I would prefer to
stick with commit timestamp/committs.

In PostgreSQL, it is pretty clearly established that time is hours,
minutes, seconds, and timestamp is years, months, days, hours, minutes,
seconds. So unless this feature only records the hour, minute, and
second of a commit, it should be "timestamp".

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Petr Jelinek
petr@2ndquadrant.com
In reply to: Peter Eisentraut (#33)
Re: tracking commit timestamps

On 03/11/14 22:26, Peter Eisentraut wrote:

On 11/1/14 8:04 AM, Petr Jelinek wrote:

On second thought, maybe those should be pg_get_transaction_committs,
pg_get_transaction_committs_data, etc.

Please don't name anything "committs". That looks like a misspelling of
something.

There is nothing wrong with

pg_get_transaction_commit_timestamp()

If you want to reduce the length, lose the "get".

I am fine with that, I only wonder if your definition of "anything" only
concerns the SQL interfaces or also the internals.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Merlin Moncure
mmoncure@gmail.com
In reply to: Peter Eisentraut (#33)
Re: tracking commit timestamps

On Mon, Nov 3, 2014 at 3:26 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 11/1/14 8:04 AM, Petr Jelinek wrote:

On second thought, maybe those should be pg_get_transaction_committs,
pg_get_transaction_committs_data, etc.

Please don't name anything "committs". That looks like a misspelling of
something.

There is nothing wrong with

pg_get_transaction_commit_timestamp()

If you want to reduce the length, lose the "get".

+1: all non void returning functions 'get' something.

For me the commit time thing feels problematic in the way I perceive it
- I see commit time as a point in time, where I see commit timestamp (or
committs for short) as something that can recorded. So I would prefer to
stick with commit timestamp/committs.

In PostgreSQL, it is pretty clearly established that time is hours,
minutes, seconds, and timestamp is years, months, days, hours, minutes,
seconds. So unless this feature only records the hour, minute, and
second of a commit, it should be "timestamp".

Elsewhere, for example, we have: "pg_last_xact_replay_timestamp()".
So, in keeping with that, maybe,

pg_xact_commit_timestamp(xid)

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Andres Freund
andres@2ndquadrant.com
In reply to: Petr Jelinek (#30)
Re: tracking commit timestamps

On 2014-11-02 19:27:25 +0100, Petr Jelinek wrote:

Well, Michael has point that the extradata is pretty much useless currently,
perhaps it would help to add the interface to set extradata?

Only accessible via C and useless aren't the same thing. But sure, add
it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#25)
Re: tracking commit timestamps

On Sat, Nov 1, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

More comments:

I have done a couple of tests on my laptop with pgbench like that to
generate a maximum of transaction commits:
$ pgbench --no-vacuum -f ~/Desktop/commit.sql -T 60 -c 24 -j 24
$ cat ~/Desktop/commit.sql
SELECT txid_current()
Here is an average of 5 runs:
- master: 49842.44
- GUC off, patched = 49688.71
- GUC on, patched = 49459.73
So there is little noise.

Here are also more comments about the code that I found while focusing on
committs.c:
1) When the GUC is not enabled, and when the transaction ID provided is not
in a correct range, a dummy value based on InvalidTransactionId (?!) is
returned, like that:
+       /* Return empty if module not enabled */
+       if (!commit_ts_enabled)
+       {
+               if (ts)
+                       *ts = InvalidTransactionId;
+               if (data)
+                       *data = (CommitExtraData) 0;
+               return;
+       }
This leads to some incorrect results:
=# select pg_get_transaction_committime('1');
 pg_get_transaction_committime
-------------------------------
 2000-01-01 09:00:00+09
(1 row)
Or for example:
=# SELECT txid_current();
 txid_current
--------------
         1006
(1 row)
=# select pg_get_transaction_committime('1006');
 pg_get_transaction_committime
-------------------------------
 2014-11-04 14:54:37.589381+09
(1 row)
=# select pg_get_transaction_committime('1007');
 pg_get_transaction_committime
-------------------------------
 2000-01-01 09:00:00+09
(1 row)
=# SELECT txid_current();
 txid_current
--------------
         1007
(1 row)
And at other times error is not that helpful for user:
=# select pg_get_transaction_committime('10000');
ERROR:  XX000: could not access status of transaction 10000
DETAIL:  Could not read from file "pg_committs/0000" at offset 114688:
Undefined error: 0.
LOCATION:  SlruReportIOError, slru.c:896
I think that you should simply return an error in
TransactionIdGetCommitTsData when xid is not in the range supported. That
will be more transparent for the user.
2) You may as well want to return a different error if the GUC
track_commit_timestamps is disabled.
3) This comment should be updated in committs.c, we are not talking about
CLOG here:
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
4) Similarly, I think more comments should be put in here. It is OK to
truncate the commit timestamp stuff similarly to CLOG to have a consistent
status context available, but let's explain it.
         * multixacts; that will be done by the next checkpoint.
         */
        TruncateCLOG(frozenXID);
+       TruncateCommitTs(frozenXID)
5) Reading the code, TransactionTreeSetCommitTimestamp is always called
with do_xlog = false, making that actually no timestamps are WAL'd... Hence
WriteSetTimestampXlogRec is just dead code with the latest version of the
patch. IMO, this should be always WAL-logged when track_commit_timestamp is
on.
6) Shouldn't any value update of track_commit_timestamp be tracked in
XLogReportParameters? That's thinking about making the commit timestamp
available on standbys as well..
7) pg_xlogdump has no support for RM_COMMITTS_ID, something that would be
useful for developers.
8) The redo and xlog routines of committs should be out in a dedicated
file, like committsxlog.c or similar. The other rmgr do so, so let's be
consistent.

Regards,
--
Michael

#38Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#37)
Re: tracking commit timestamps

On 2014-11-04 17:19:18 +0900, Michael Paquier wrote:

5) Reading the code, TransactionTreeSetCommitTimestamp is always called
with do_xlog = false, making that actually no timestamps are WAL'd... Hence
WriteSetTimestampXlogRec is just dead code with the latest version of the
patch. IMO, this should be always WAL-logged when track_commit_timestamp is
on.

It's callable via a 'extern' function. So, I'd not consider it dead. And
the WAL logging is provided by xact.c's own WAL logging - it always does
the corresponding committs calls.

6) Shouldn't any value update of track_commit_timestamp be tracked in
XLogReportParameters? That's thinking about making the commit timestamp
available on standbys as well..

Yes, it should.

7) pg_xlogdump has no support for RM_COMMITTS_ID, something that would be
useful for developers.

What do you mean by that? There's the corresponding rmgrdesc.c support I
think?

8) The redo and xlog routines of committs should be out in a dedicated
file, like committsxlog.c or similar. The other rmgr do so, so let's be
consistent.

Seems pointless to me. The file isn't that big and the other SLRUs don't
do it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#36)
Re: tracking commit timestamps

On Tue, Nov 4, 2014 at 5:05 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-11-02 19:27:25 +0100, Petr Jelinek wrote:

Well, Michael has point that the extradata is pretty much useless

currently,

perhaps it would help to add the interface to set extradata?

Only accessible via C and useless aren't the same thing. But sure, add
it.

I'm still on a -1 for that. You mentioned that there is perhaps no reason
to delay a decision on this matter, but IMO there is no reason to rush
either in doing something we may regret. And I am not the only one on this
thread expressing concern about this extra data thingy.

If this extra data field is going to be used to identify from which node a
commit comes from, then it is another feature than what is written on the
subject of this thread. In this case let's discuss it in the thread
dedicated to replication identifiers, or come up with an extra patch once
the feature for commit timestamps is done.
--
Michael

#40Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#38)
Re: tracking commit timestamps

On Tue, Nov 4, 2014 at 5:23 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-11-04 17:19:18 +0900, Michael Paquier wrote:

5) Reading the code, TransactionTreeSetCommitTimestamp is always called
with do_xlog = false, making that actually no timestamps are WAL'd...

Hence

WriteSetTimestampXlogRec is just dead code with the latest version of the
patch. IMO, this should be always WAL-logged when track_commit_timestamp

is

on.

It's callable via a 'extern' function. So, I'd not consider it dead. And
the WAL logging is provided by xact.c's own WAL logging - it always does
the corresponding committs calls.

The code path is unused. We'd better make the XLOG record mandatory if
tracking is enabled, as this information is useful on standbys as well.

7) pg_xlogdump has no support for RM_COMMITTS_ID, something that would be
useful for developers.

What do you mean by that? There's the corresponding rmgrdesc.c support I
think?

Oops sorry. I thought there was some big switch in pg_xlogdump when writing
this comment. Yeah that's fine.
--
Michael

#41Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#40)
Re: tracking commit timestamps

On 2014-11-04 17:29:04 +0900, Michael Paquier wrote:

On Tue, Nov 4, 2014 at 5:23 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-11-04 17:19:18 +0900, Michael Paquier wrote:

5) Reading the code, TransactionTreeSetCommitTimestamp is always called
with do_xlog = false, making that actually no timestamps are WAL'd...

Hence

WriteSetTimestampXlogRec is just dead code with the latest version of the
patch. IMO, this should be always WAL-logged when track_commit_timestamp

is

on.

It's callable via a 'extern' function. So, I'd not consider it dead. And
the WAL logging is provided by xact.c's own WAL logging - it always does
the corresponding committs calls.

The code path is unused.

No. It is not. It can be called by extensions?

We'd better make the XLOG record mandatory if
tracking is enabled, as this information is useful on standbys as well.

Did you read what I wrote? To quote "And the WAL logging is provided by
xact.c's own WAL logging - it always does the corresponding committs
calls.".

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#39)
Re: tracking commit timestamps

Michael Paquier wrote:

I'm still on a -1 for that. You mentioned that there is perhaps no reason
to delay a decision on this matter, but IMO there is no reason to rush
either in doing something we may regret. And I am not the only one on this
thread expressing concern about this extra data thingy.

If this extra data field is going to be used to identify from which node a
commit comes from, then it is another feature than what is written on the
subject of this thread. In this case let's discuss it in the thread
dedicated to replication identifiers, or come up with an extra patch once
the feature for commit timestamps is done.

Introducing the extra data field in a later patch would mean an on-disk
representation change, i.e. pg_upgrade trouble.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Andres Freund
andres@2ndquadrant.com
In reply to: Alvaro Herrera (#42)
Re: tracking commit timestamps

On 2014-11-04 10:01:00 -0300, Alvaro Herrera wrote:

Michael Paquier wrote:

I'm still on a -1 for that. You mentioned that there is perhaps no reason
to delay a decision on this matter, but IMO there is no reason to rush
either in doing something we may regret. And I am not the only one on this
thread expressing concern about this extra data thingy.

If this extra data field is going to be used to identify from which node a
commit comes from, then it is another feature than what is written on the
subject of this thread. In this case let's discuss it in the thread
dedicated to replication identifiers, or come up with an extra patch once
the feature for commit timestamps is done.

Introducing the extra data field in a later patch would mean an on-disk
representation change, i.e. pg_upgrade trouble.

It's also simply not necessarily related to replication
identifiers. This is useful whether replication identifiers make it in
or not. It allows to implement something like replication identifiers
outside of core (albeit with a hefty overhead in OLTP workloads).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#36)
Re: tracking commit timestamps

On 04/11/14 09:05, Andres Freund wrote:

On 2014-11-02 19:27:25 +0100, Petr Jelinek wrote:

Well, Michael has point that the extradata is pretty much useless currently,
perhaps it would help to add the interface to set extradata?

Only accessible via C and useless aren't the same thing. But sure, add
it.

I actually meant nicer C api - the one that will make it possible to say
for this transaction, use this extradata (or for all transactions from
now on done by this backend use this extradata), instead of current API
where you have to overwrite what RecordCommit already wrote.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#37)
Re: tracking commit timestamps

On 04/11/14 09:19, Michael Paquier wrote:

On Sat, Nov 1, 2014 at 10:00 PM, Michael Paquier
<michael.paquier@gmail.com <mailto:michael.paquier@gmail.com>> wrote:

More comments:

Here are also more comments about the code that I found while focusing
on committs.c:
1) When the GUC is not enabled, and when the transaction ID provided is
not in a correct range, a dummy value based on InvalidTransactionId (?!)
is returned, like that:
+       /* Return empty if module not enabled */
+       if (!commit_ts_enabled)
+       {
+               if (ts)
+                       *ts = InvalidTransactionId;
+               if (data)
+                       *data = (CommitExtraData) 0;
+               return;
+       }
This leads to some incorrect results:
=# select pg_get_transaction_committime('1');
pg_get_transaction_committime
-------------------------------
2000-01-01 09:00:00+09
(1 row)

Oh, I had this on my TODO and somehow forgot about it, I am leaning
towards throwing an error when calling any committs "get" function while
the tracking is disabled.

Or for example:
=# SELECT txid_current();
txid_current
--------------
1006
(1 row)
=# select pg_get_transaction_committime('1006');
pg_get_transaction_committime
-------------------------------
2014-11-04 14:54:37.589381+09
(1 row)
=# select pg_get_transaction_committime('1007');
pg_get_transaction_committime
-------------------------------
2000-01-01 09:00:00+09
(1 row)
=# SELECT txid_current();
txid_current
--------------
1007
(1 row)
And at other times error is not that helpful for user:
=# select pg_get_transaction_committime('10000');
ERROR: XX000: could not access status of transaction 10000
DETAIL: Could not read from file "pg_committs/0000" at offset 114688:
Undefined error: 0.
LOCATION: SlruReportIOError, slru.c:896
I think that you should simply return an error in
TransactionIdGetCommitTsData when xid is not in the range supported.
That will be more transparent for the user.

Agreed.

5) Reading the code, TransactionTreeSetCommitTimestamp is always called
with do_xlog = false, making that actually no timestamps are WAL'd...
Hence WriteSetTimestampXlogRec is just dead code with the latest version
of the patch. IMO, this should be always WAL-logged when
track_commit_timestamp is on.

As Andres explained this is always WAL-logged when called by current
caller so we don't want it to be double logged, so that's why do_xlog =
false, but when extension will need call it, it will most likely need
do_xlog = true.

8) The redo and xlog routines of committs should be out in a dedicated
file, like committsxlog.c or similar. The other rmgr do so, so let's be
consistent.

Most actually don't AFAICS.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Peter Eisentraut
peter_e@gmx.net
In reply to: Petr Jelinek (#34)
Re: tracking commit timestamps

On 11/3/14 5:17 PM, Petr Jelinek wrote:

Please don't name anything "committs". That looks like a misspelling of
something.

There is nothing wrong with

pg_get_transaction_commit_timestamp()

If you want to reduce the length, lose the "get".

I am fine with that, I only wonder if your definition of "anything" only
concerns the SQL interfaces or also the internals.

I'd be fine with commit_ts for internals, but not committs.

One day, you'll need a function or data structure that works with
multiple of these, and then you'll really be in naming trouble. ;-)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#39)
Re: tracking commit timestamps

On 04/11/14 09:25, Michael Paquier wrote:

On Tue, Nov 4, 2014 at 5:05 PM, Andres Freund <andres@2ndquadrant.com
<mailto:andres@2ndquadrant.com>> wrote:

On 2014-11-02 19:27:25 +0100, Petr Jelinek wrote:

Well, Michael has point that the extradata is pretty much useless currently,
perhaps it would help to add the interface to set extradata?

Only accessible via C and useless aren't the same thing. But sure, add
it.

I'm still on a -1 for that. You mentioned that there is perhaps no
reason to delay a decision on this matter, but IMO there is no reason to
rush either in doing something we may regret. And I am not the only one
on this thread expressing concern about this extra data thingy.

If this extra data field is going to be used to identify from which node
a commit comes from, then it is another feature than what is written on
the subject of this thread. In this case let's discuss it in the thread
dedicated to replication identifiers, or come up with an extra patch
once the feature for commit timestamps is done.

The issue as I see it is that both of those features are closely related
and just one without the other has very limited use. What I learned from
working on UDR is that for conflict handling, I was actually missing the
extradata more than the timestamp itself - in other words I have
extension for 9.4 where I have use for this API already, so the argument
about dead code or forks or whatever does not really hold.

The other problem is that if we add extradata later we will either break
upgrade-ability of will have to write essentially same code again which
will store just the extradata instead of the timestamp, I don't really
like either of those options to be honest.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Petr Jelinek
petr@2ndquadrant.com
In reply to: Peter Eisentraut (#46)
Re: tracking commit timestamps

On 04/11/14 22:20, Peter Eisentraut wrote:

On 11/3/14 5:17 PM, Petr Jelinek wrote:

Please don't name anything "committs". That looks like a misspelling of
something.

There is nothing wrong with

pg_get_transaction_commit_timestamp()

If you want to reduce the length, lose the "get".

I am fine with that, I only wonder if your definition of "anything" only
concerns the SQL interfaces or also the internals.

I'd be fine with commit_ts for internals, but not committs.

One day, you'll need a function or data structure that works with
multiple of these, and then you'll really be in naming trouble. ;-)

Hmm we use CommitTs in interfaces that uses CamelCase naming so I guess
commit_ts is indeed natural expansion of that.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Michael Paquier
michael.paquier@gmail.com
In reply to: Alvaro Herrera (#42)
Re: tracking commit timestamps

On Tue, Nov 4, 2014 at 10:01 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Michael Paquier wrote:

I'm still on a -1 for that. You mentioned that there is perhaps no reason
to delay a decision on this matter, but IMO there is no reason to rush
either in doing something we may regret. And I am not the only one on

this

thread expressing concern about this extra data thingy.

If this extra data field is going to be used to identify from which node

a

commit comes from, then it is another feature than what is written on the
subject of this thread. In this case let's discuss it in the thread
dedicated to replication identifiers, or come up with an extra patch once
the feature for commit timestamps is done.

Introducing the extra data field in a later patch would mean an on-disk
representation change, i.e. pg_upgrade trouble.

Then why especially 4 bytes for the extra field? Why not 8 or 16?
--
Michael

#50Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#49)
Re: tracking commit timestamps

On 2014-11-05 08:57:07 +0900, Michael Paquier wrote:

On Tue, Nov 4, 2014 at 10:01 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Michael Paquier wrote:

I'm still on a -1 for that. You mentioned that there is perhaps no reason
to delay a decision on this matter, but IMO there is no reason to rush
either in doing something we may regret. And I am not the only one on

this

thread expressing concern about this extra data thingy.

If this extra data field is going to be used to identify from which node

a

commit comes from, then it is another feature than what is written on the
subject of this thread. In this case let's discuss it in the thread
dedicated to replication identifiers, or come up with an extra patch once
the feature for commit timestamps is done.

Introducing the extra data field in a later patch would mean an on-disk
representation change, i.e. pg_upgrade trouble.

Then why especially 4 bytes for the extra field? Why not 8 or 16?

It's sufficiently long that you can build infrastructure to storing more
transaction metadata data ontop. I could live making it 8 bytes, but I
don't see a clear advantage.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#32)
Re: tracking commit timestamps

On 11/3/14, 2:36 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

On 11/1/14, 8:41 AM, Petr Jelinek wrote:

Well this is not BDR specific thing, the idea is that with logical replication, commit timestamp is not enough for conflict handling, you also need to have additional info in order to identify some types of conflicts conflicts (local update vs remote update for example). So the extradata field was meant as something that could be used to add the additional info to the xid.

Related to this... is there any way to deal with 2 transactions that commit in the same microsecond? It seems silly to try and handle that for every commit since it should be quite rare, but perhaps we could store the LSN as extradata if we detect a conflict?

Well, two things. One, LSN is 8 bytes and extradata (at least in this
patch when I last saw it) is only 4 bytes. But secondly and more
important is that detecting a conflict is done in node B *after* node A
has recorded the transaction's commit time; there is no way to know at
commit time that there is going to be a conflict caused by that
transaction in the future. (If there was a way to tell, you could just
as well not commit the transaction in the first place.)

I'm worried about 2 commits in the same microsecond on the same system, not on 2 different systems. Or, put another way, if we're going to expose this I think it should also provide a guaranteed unique commit ordering for a single cluster. Presumably, this shouldn't be that hard since we do know the exact order in which things committed.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Anssi Kääriäinen
anssi.kaariainen@thl.fi
In reply to: Jim Nasby (#51)
Re: tracking commit timestamps

On Tue, 2014-11-04 at 23:43 -0600, Jim Nasby wrote:

I'm worried about 2 commits in the same microsecond on the same
system, not on 2 different systems. Or, put another way, if we're
going to expose this I think it should also provide a guaranteed
unique commit ordering for a single cluster. Presumably, this
shouldn't be that hard since we do know the exact order in which
things committed.

Addition of LSN when the timestamps for two transactions are exactly
same isn't enough. There isn't any guarantee that a later commit gets a
later timestamp than an earlier commit.

In addition, I wonder if this feature would be misused. Record
transaction ids to a table to find out commit order (use case could be
storing historical row versions for example). Do a dump and restore on
another cluster, and all the transaction ids are completely meaningless
to the system.

Having the ability to record commit order into an audit table would be
extremely welcome, but as is, this feature doesn't provide it.

- Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Michael Paquier
michael.paquier@gmail.com
In reply to: Anssi Kääriäinen (#52)
Re: tracking commit timestamps

On Wed, Nov 5, 2014 at 5:24 PM, Anssi Kääriäinen <anssi.kaariainen@thl.fi>
wrote:

On Tue, 2014-11-04 at 23:43 -0600, Jim Nasby wrote:

I'm worried about 2 commits in the same microsecond on the same
system, not on 2 different systems. Or, put another way, if we're
going to expose this I think it should also provide a guaranteed
unique commit ordering for a single cluster. Presumably, this
shouldn't be that hard since we do know the exact order in which
things committed.

Addition of LSN when the timestamps for two transactions are exactly
same isn't enough. There isn't any guarantee that a later commit gets a
later timestamp than an earlier commit.

True if WAL record ID is not globally consistent. Two-level commit ordering
can be done with (timestamp or LSN, nodeID). At equal timestamp, we could
say as well that the node with the lowest systemID wins for example. That's
not something

In addition, I wonder if this feature would be misused. Record
transaction ids to a table to find out commit order (use case could be
storing historical row versions for example). Do a dump and restore on
another cluster, and all the transaction ids are completely meaningless
to the system.

I think you are forgetting the fact to be able to take a consistent dump
using an exported snapshot. In this case the commit order may not be that
meaningless..

Having the ability to record commit order into an audit table would be
extremely welcome, but as is, this feature doesn't provide it.

That's something that can actually be achieved with this feature if the SQL
interface is able to query all the timestamps in a xid range with for
example a background worker that tracks this data periodically. Now the
thing is as well: how much timestamp history do we want to keep? The patch
truncating SLRU files with frozenID may cover a sufficient range...
--
Michael

#54Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Michael Paquier (#53)
Re: tracking commit timestamps

On 11/5/14, 6:10 AM, Michael Paquier wrote:

In addition, I wonder if this feature would be misused. Record
transaction ids to a table to find out commit order (use case could be
storing historical row versions for example). Do a dump and restore on
another cluster, and all the transaction ids are completely meaningless
to the system.

I think you are forgetting the fact to be able to take a consistent dump using an exported snapshot. In this case the commit order may not be that meaningless..

Anssi's point is that you can't use xmin because it can change, but I think anyone working with this feature would understand that.

Having the ability to record commit order into an audit table would be
extremely welcome, but as is, this feature doesn't provide it.

That's something that can actually be achieved with this feature if the SQL interface is able to query all the timestamps in a xid range with for example a background worker that tracks this data periodically. Now the thing is as well: how much timestamp history do we want to keep? The patch truncating SLRU files with frozenID may cover a sufficient range...

Except that commit time is not guaranteed unique *even on a single system*. That's my whole point. If we're going to bother with all the commit time machinery it seems really silly to provide a way to uniquely order every commit.

Clearly trying to uniquely order commits across multiple systems is a far larger problem, and I'm not suggesting we attempt that. But for a single system AIUI all we need to do is expose the LSN of each commit record and that will give you the exact and unique order in which transactions committed.

This isn't a hypothetical feature either; if we had this, logical replication systems wouldn't have to try and fake this via batches. You could actually recreate exactly what data was visible at what time to all transactions, not just repeatable read ones (as long as you kept snapshot data as well, which isn't hard).

As for how much data to keep, if you have a process that's doing something to record this information permanently all it needs to do is keep an old enough snapshot around. That's not that hard to do, even from user space.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Andres Freund
andres@2ndquadrant.com
In reply to: Jim Nasby (#54)
Re: tracking commit timestamps

On 2014-11-05 10:23:15 -0600, Jim Nasby wrote:

On 11/5/14, 6:10 AM, Michael Paquier wrote:

In addition, I wonder if this feature would be misused. Record
transaction ids to a table to find out commit order (use case could be
storing historical row versions for example). Do a dump and restore on
another cluster, and all the transaction ids are completely meaningless
to the system.

I think you are forgetting the fact to be able to take a consistent dump using an exported snapshot. In this case the commit order may not be that meaningless..

Anssi's point is that you can't use xmin because it can change, but I think anyone working with this feature would understand that.

Having the ability to record commit order into an audit table would be
extremely welcome, but as is, this feature doesn't provide it.

That's something that can actually be achieved with this feature if
the SQL interface is able to query all the timestamps in a xid range
with for example a background worker that tracks this data
periodically. Now the thing is as well: how much timestamp history do
we want to keep? The patch truncating SLRU files with frozenID may
cover a sufficient range...

Except that commit time is not guaranteed unique *even on a single
system*. That's my whole point. If we're going to bother with all the
commit time machinery it seems really silly to provide a way to
uniquely order every commit.

Well. I think that's the misunderstanding here. That's absolutely not
what committs is supposed to be used for. For the replication stream
you'd hopefully use logical decoding. That gives you the transaction
data exactly in commit order.

Clearly trying to uniquely order commits across multiple systems is a
far larger problem, and I'm not suggesting we attempt that. But for a
single system AIUI all we need to do is expose the LSN of each commit
record and that will give you the exact and unique order in which
transactions committed.

I don't think that's something you should attempt. That's what logical
decoding is for. Hence I see little point in exposing the LSN that way.

Where I think committs is useful is a method for analyzing and resolving
conflicts between multiple systems. In that case you likely can't use
the LSN for anything as it won't be very meaningful. If you get
conflicts below the accuracy of the timestamps you better use another
deterministic method of resolving them - BDR e.g. compares the system
identifier, timeline id, database oid, and a user defined name. While
enforcing that those aren't the same between systems.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#55)
Re: tracking commit timestamps

On 11/5/14, 10:30 AM, Andres Freund wrote:

Except that commit time is not guaranteed unique *even on a single

system*. That's my whole point. If we're going to bother with all the
commit time machinery it seems really silly to provide a way to
uniquely order every commit.

Well. I think that's the misunderstanding here. That's absolutely not
what committs is supposed to be used for. For the replication stream
you'd hopefully use logical decoding. That gives you the transaction
data exactly in commit order.

So presumably you'd want to use logical decoding to insert into a table with a sequence on it, or similar?

I agree, that sounds like a better way to handle this. I think it's worth mentioning in the docs for commit_ts, because people WILL mistakenly try and use it to determine commit ordering.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Andres Freund
andres@2ndquadrant.com
In reply to: Jim Nasby (#56)
Re: tracking commit timestamps

On 2014-11-05 10:34:40 -0600, Jim Nasby wrote:

On 11/5/14, 10:30 AM, Andres Freund wrote:

Except that commit time is not guaranteed unique *even on a single

system*. That's my whole point. If we're going to bother with all the
commit time machinery it seems really silly to provide a way to
uniquely order every commit.

Well. I think that's the misunderstanding here. That's absolutely not
what committs is supposed to be used for. For the replication stream
you'd hopefully use logical decoding. That gives you the transaction
data exactly in commit order.

So presumably you'd want to use logical decoding to insert into a
table with a sequence on it, or similar?

I'm not following. I'd use logical decoding to replicate the data to
another system, thereby guaranteeing its done in commit order. Then,
when applying the data on the other side, I can detect/resolve some
forms of conflicts by looking at the timestamps of rows via committs.

I agree, that sounds like a better way to handle this. I think it's
worth mentioning in the docs for commit_ts, because people WILL
mistakenly try and use it to determine commit ordering.

Ok, sounds sensible.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Kevin Grittner
kgrittn@ymail.com
In reply to: Jim Nasby (#54)
Re: tracking commit timestamps

Jim Nasby <Jim.Nasby@BlueTreble.com> wrote:

for a single system AIUI all we need to do is expose the LSN of
each commit record and that will give you the exact and unique
order in which transactions committed.

This isn't a hypothetical feature either; if we had this,
logical replication systems wouldn't have to try and fake this
via batches. You could actually recreate exactly what data was
visible at what time to all transactions, not just repeatable
read ones (as long as you kept snapshot data as well, which isn't
hard).

Well, that not entirely true for serializable transactions; there
are points in time where reading the committed state could cause a
transaction to roll back[1]https://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions -- either a writing transaction which
would make that visible state inconsistent with the later committed
state or the reading transaction if it views something which is not
(yet) consistent.

That's not to say that this feature is a bad idea; part of the
serializable implementation itself depends on being able to
accurately determine commit order, and this feature could allow
that to work more efficiently. I'm saying that, like hot standby,
a replicated database could not provide truly serializable
transactions (even read only ones) without something else in
addition to this. We've discussed various ways of doing that.
Perhaps the most promising is to include in the stream some
indication of which points in the transaction stream are safe for a
serializable transaction to read. If there's a way to implement
commit order recording such that a two-state flag could be
associated with each commit, I think it could be made to work for
serializable transactions.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: https://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Steve Singer
steve@ssinger.info
In reply to: Jim Nasby (#54)
Re: tracking commit timestamps

On 11/05/2014 11:23 AM, Jim Nasby wrote:

Except that commit time is not guaranteed unique *even on a single
system*. That's my whole point. If we're going to bother with all the
commit time machinery it seems really silly to provide a way to
uniquely order every commit.

Clearly trying to uniquely order commits across multiple systems is a
far larger problem, and I'm not suggesting we attempt that. But for a
single system AIUI all we need to do is expose the LSN of each commit
record and that will give you the exact and unique order in which
transactions committed.

This isn't a hypothetical feature either; if we had this, logical
replication systems wouldn't have to try and fake this via batches.
You could actually recreate exactly what data was visible at what time
to all transactions, not just repeatable read ones (as long as you
kept snapshot data as well, which isn't hard).

As for how much data to keep, if you have a process that's doing
something to record this information permanently all it needs to do is
keep an old enough snapshot around. That's not that hard to do, even
from user space.

+1 for this.

It isn't just 'replication' systems that have a need for getting the
commit order of transactions on a single system. I have a application
(not slony) where we want to query a table but order the output based on
the transaction commit order of when the insert into the table was done
(think of a queue). I'm not replicating the output but passing the data
to other applications for further processing. If I just had the commit
timestamp I would need to put in some other condition to break ties in a
consistent way. I think being able to get an ordering by commit LSN is
what I really want in this case not the timestamp.

Logical decoding is one solution to this (that I was considering) but
being able to do something like
select * FROM event_log order by commit_id would be a lot simpler.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Andres Freund
andres@2ndquadrant.com
In reply to: Steve Singer (#59)
Re: tracking commit timestamps

On 2014-11-05 17:17:05 -0500, Steve Singer wrote:

It isn't just 'replication' systems that have a need for getting the commit
order of transactions on a single system. I have a application (not slony)
where we want to query a table but order the output based on the transaction
commit order of when the insert into the table was done (think of a queue).
I'm not replicating the output but passing the data to other applications
for further processing. If I just had the commit timestamp I would need to
put in some other condition to break ties in a consistent way. I think
being able to get an ordering by commit LSN is what I really want in this
case not the timestamp.

Logical decoding is one solution to this (that I was considering) but being
able to do something like
select * FROM event_log order by commit_id would be a lot simpler.

Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Steve Singer
steve@ssinger.info
In reply to: Andres Freund (#60)
Re: tracking commit timestamps

On 11/05/2014 05:43 PM, Andres Freund wrote:

On 2014-11-05 17:17:05 -0500, Steve Singer wrote:
Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.

I think this gets to the heart of some of the differing views people
have expressed on this patch

Is this patch supposed to:

A) Add commit timestamp tracking but nothing more

B) Add infrastructure to store commit timestamps and provide a facility
for storing additional bits of data extensions might want to be
associated with the commit

C). Add commit timestamps and node identifiers to commits

If the answer is (A) then I can see why some people are objecting to
adding extradata. If the answer is (B) then it's fair to ask how well
does this patch handle various types of things people might want to
attach to the commit record (such as the LSN). I think the problem is
that once you start providing a facility extensions can use to store
data along with the commit record then being restricted to 4 or 8 bytes
is very limiting. It also doesn't allow you to load two extensions at
once on a system. You wouldn't be able to have both the
'steve_commit_order' extension and BDR installed at the same time. I
don't think this patch does a very good job at (B) but It wasn't
intended to.

If what we are really doing is C, and just calling the space 'extradata'
until we get the logical identifier stuff in and then we are going to
rename extradata to nodeid then we should say so. If we are really
concerned about the pg_upgrade impact of expanding the record later then
maybe we should add 4 bytes of padding to the CommitTimeStampEntry now
and but leave the manipulating the node id until later.

Steve

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Andres Freund
andres@2ndquadrant.com
In reply to: Steve Singer (#61)
Re: tracking commit timestamps

On 2014-11-05 19:31:52 -0500, Steve Singer wrote:

On 11/05/2014 05:43 PM, Andres Freund wrote:

On 2014-11-05 17:17:05 -0500, Steve Singer wrote:
Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.

I think this gets to the heart of some of the differing views people have
expressed on this patch

I think it's actually besides the heart...

Is this patch supposed to:

A) Add commit timestamp tracking but nothing more

B) Add infrastructure to store commit timestamps and provide a facility for
storing additional bits of data extensions might want to be associated with
the commit

C). Add commit timestamps and node identifiers to commits

If the answer is (A) then I can see why some people are objecting to adding
extradata. If the answer is (B) then it's fair to ask how well does this
patch handle various types of things people might want to attach to the
commit record (such as the LSN).

I think there's a mistake exactly here. The LSN of the commit isn't just
some extra information about the commit. You can't say 'here, also
attach this piece of information'. Instead you need special case code in
xact.c to add it. Thus prohibiting that space to be used for something
else.

I think the problem is that once you
start providing a facility extensions can use to store data along with the
commit record then being restricted to 4 or 8 bytes is very limiting.

Well, you can easily use those 4/8 bytes to start adding more data to
the transaction. By referring to some table with transaction metadata
for example.

It also doesn't allow you to load two extensions at once on a system.
You wouldn't be able to have both the 'steve_commit_order' extension
and BDR installed at the same time. I don't think this patch does a
very good job at (B) but It wasn't intended to.

Well, I don't agree that steve_commit_order makes much sense in this
context. But there actually is a real problem here, namely that there's
no namespacing in those bytes. I'd be ok with saying that we split the
extradata in for bytes for the namespace and four for the actual
data. That's roughly how it works for advisory locks already.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Craig Ringer
craig@2ndquadrant.com
In reply to: Michael Paquier (#25)
Re: tracking commit timestamps

On 11/01/2014 09:00 PM, Michael Paquier wrote:

1) It is untested and actually there is no direct use for it in core.
2) Pushing code that we know as dead is no good, that's a feature more
or less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.
3) If you're going to re-use this API in BDR, which is a fork of
Postgres.

I would like to emphasise that BDR is not really a fork at all, no more
than any big topic branch would be a fork.

BDR is two major parts:

- A collection of patches to core (commit timestamps/extradata, sequence
AM, replication identifiers, logical decoding, DDL deparse, event
triggers, etc). These are being progressively submitted to core.
maintained as multiple feature branches plus a merged version; and

- An extension that uses core features and, where necessary, the
additions to core to implement bi-directional logical replication.

Because of the time scales involved in getting things into core it's
been necessary to *temporarily* get the 9.4-based feature branch into
wider use so that it can be used to run the BDR extension, but if we can
get the required features into core that need will go away.

Event triggers and logical decoding were already merged in 9.4.

If we can get things like commit timestamps, commit extradata / logical
replication identifiers, the sequence access method, etc merged in 9.5
then it should be possible to do away with the need for the patches to
core entirely and run BDR on top of stock 9.5. I'd be delighted if that
were possible, as doing away with the patched 9.4 would get rid of a
great deal of work and frustration on my part.

Note that the BDR extension its self is PostgreSQL-licensed. Externally
maintained extensions have been bought in-core before. It's a lot of
code though, and I can't imagine that being a quick process.

You'd better complete this API in BDR by yourself and not
bother core with that.

This argument would've prevented the inclusion of logical decoding,
which is rapidly becoming the headline feature for 9.4, or at least
shortly behind jsonb. Slony is being adapted to use it, multiple people
are working on auditing systems based on it, and AFAIK EDB's xDB is
being adapted to take advantage of it too.

As development gets more complex and people attempt bigger features, the
One Big Patch that adds a feature and an in-core user of the feature is
not going to be a viable approach all the time. In my view it's already
well past that, and some recent patches (like RLS) really should've been
split up into patch series.

If we want to avoid unreviewable monster-patches it will, IMO, be
necessary to have progressive, iterative enhancement. That may sometimes
mean that there's code in core that's only used by future
yet-to-be-merged patches and/or by extensions.

Of course its desirable to have an in-tree user of the code wherever
possible/practical - but sometimes it may *not* be possible or
practical. It seems to me like the benefits of committing work in
smaller, more reviewable chunks outweigh the benefits of merging
multiple related but separate changes just so everything can have an
immediate in-tree user.

That's not to say that extradata must remain glued to commit timestamps.
It might make more sense as a separate patch with an API to allow
extensions to manipulate it directly, plus a dummy extension showing how
it works, like we do with various hooks and with APIs like FDWs.
However, just like the various hooks that we have, it *does* make sense
to have something in-core that has no "real world" in-core users.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#62)
Re: tracking commit timestamps

On 06/11/14 08:50, Andres Freund wrote:

On 2014-11-05 19:31:52 -0500, Steve Singer wrote:

It also doesn't allow you to load two extensions at once on a system.
You wouldn't be able to have both the 'steve_commit_order' extension
and BDR installed at the same time. I don't think this patch does a
very good job at (B) but It wasn't intended to.

Well, I don't agree that steve_commit_order makes much sense in this
context. But there actually is a real problem here, namely that there's
no namespacing in those bytes. I'd be ok with saying that we split the
extradata in for bytes for the namespace and four for the actual
data. That's roughly how it works for advisory locks already.

I am not sure how would this solve problem of 2 extensions using the
extradata given that there can be only 1 record per txid anyway.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Petr Jelinek
petr@2ndquadrant.com
In reply to: Steve Singer (#61)
Re: tracking commit timestamps

On 06/11/14 01:31, Steve Singer wrote:

On 11/05/2014 05:43 PM, Andres Freund wrote:

Is this patch supposed to:

A) Add commit timestamp tracking but nothing more

B) Add infrastructure to store commit timestamps and provide a facility
for storing additional bits of data extensions might want to be
associated with the commit

C). Add commit timestamps and node identifiers to commits

If the answer is (A) then I can see why some people are objecting to
adding extradata. If the answer is (B) then it's fair to ask how well
does this patch handle various types of things people might want to
attach to the commit record (such as the LSN). I think the problem is
that once you start providing a facility extensions can use to store
data along with the commit record then being restricted to 4 or 8 bytes
is very limiting. It also doesn't allow you to load two extensions at
once on a system. You wouldn't be able to have both the
'steve_commit_order' extension and BDR installed at the same time. I
don't think this patch does a very good job at (B) but It wasn't
intended to.

I would love to have (B) but I don't think that's realistic, at least
not in the extent some people on this thread would like. I mean you can
already do (B) by using table, it just isn't that great when it comes to
performance of that solution.

This patch is aimed to do limited version of (B) where you don't have
dynamic record for storing whatever you might desire but on the upside
the performance is good. And yes so far this look more like we are
actually doing (C) since main purpose of the patch is enabling conflict
detection and resolving of those conflicts, which is useful in many
replication scenarios that are not limited to the classical multi-master
setup.

If what we are really doing is C, and just calling the space 'extradata'
until we get the logical identifier stuff in and then we are going to
rename extradata to nodeid then we should say so. If we are really
concerned about the pg_upgrade impact of expanding the record later then
maybe we should add 4 bytes of padding to the CommitTimeStampEntry now
and but leave the manipulating the node id until later.

This might not be bad idea. I don't see the extradata being useful for
multiple extensions at the same time given that there is single record
per txid unless we would enforce some kind of limitation that extension
can only set the extradata for txids produced by that extension.

The namespacing idea that Andres has would again work fine for various
replication solutions as it would make it easier for them to coexist but
it would still not work for your 'steve_commit_order' (which I also
think should be done differently anyway).

In general I do see this patch to be similar in purpose to what we did
with replica triggers or logical decoding, those features also didn't
really have in-core use, were optional and enabled us to take step
forward with replication and had some side uses besides replication just
like commit timestamps do.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Andres Freund
andres@2ndquadrant.com
In reply to: Petr Jelinek (#64)
Re: tracking commit timestamps

On 2014-11-07 17:54:32 +0100, Petr Jelinek wrote:

On 06/11/14 08:50, Andres Freund wrote:

On 2014-11-05 19:31:52 -0500, Steve Singer wrote:

It also doesn't allow you to load two extensions at once on a system.
You wouldn't be able to have both the 'steve_commit_order' extension
and BDR installed at the same time. I don't think this patch does a
very good job at (B) but It wasn't intended to.

Well, I don't agree that steve_commit_order makes much sense in this
context. But there actually is a real problem here, namely that there's
no namespacing in those bytes. I'd be ok with saying that we split the
extradata in for bytes for the namespace and four for the actual
data. That's roughly how it works for advisory locks already.

I am not sure how would this solve problem of 2 extensions using the
extradata given that there can be only 1 record per txid anyway.

It'd help you to detect problems.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Robert Haas
robertmhaas@gmail.com
In reply to: Steve Singer (#61)
Re: tracking commit timestamps

On Nov 5, 2014, at 7:31 PM, Steve Singer <steve@ssinger.info> wrote:

On 11/05/2014 05:43 PM, Andres Freund wrote:
On 2014-11-05 17:17:05 -0500, Steve Singer wrote:
Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.

I think this gets to the heart of some of the differing views people have expressed on this patch

Is this patch supposed to:

A) Add commit timestamp tracking but nothing more

B) Add infrastructure to store commit timestamps and provide a facility for storing additional bits of data extensions might want to be associated with the commit

C). Add commit timestamps and node identifiers to commits

Well put.

I think the authors of this patch are suffering from a certain amount of myopia. Commit timestamps are useful, but so are commit LSNs, and it makes little sense to me to suppose that we should have two different systems for those closely-related needs.

Like Andres, I think B is impractical, so let's just be honest and admit that C is what we're really doing. But let's add LSNs so the people who want that can be happy too.

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#46)
Re: tracking commit timestamps

On Nov 4, 2014, at 4:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 11/3/14 5:17 PM, Petr Jelinek wrote:

Please don't name anything "committs". That looks like a misspelling of
something.

There is nothing wrong with

pg_get_transaction_commit_timestamp()

If you want to reduce the length, lose the "get".

I am fine with that, I only wonder if your definition of "anything" only
concerns the SQL interfaces or also the internals.

I'd be fine with commit_ts for internals, but not committs.

I agree that committs is poor. But I'd argue for spelling out commit_timestamp everywhere. It is more clear and easier to grep.

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#67)
Re: tracking commit timestamps

On 08/11/14 00:35, Robert Haas wrote:

On Nov 5, 2014, at 7:31 PM, Steve Singer <steve@ssinger.info> wrote:

On 11/05/2014 05:43 PM, Andres Freund wrote:
On 2014-11-05 17:17:05 -0500, Steve Singer wrote:
Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.

I think this gets to the heart of some of the differing views people have expressed on this patch

Is this patch supposed to:

A) Add commit timestamp tracking but nothing more

B) Add infrastructure to store commit timestamps and provide a facility for storing additional bits of data extensions might want to be associated with the commit

C). Add commit timestamps and node identifiers to commits

Well put.

I think the authors of this patch are suffering from a certain amount of myopia. Commit timestamps are useful, but so are commit LSNs, and it makes little sense to me to suppose that we should have two different systems for those closely-related needs.

Like Andres, I think B is impractical, so let's just be honest and admit that C is what we're really doing. But let's add LSNs so the people who want that can be happy too.

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I
still want to have some bytes for storing the "origin" or whatever you
want to call it there, as that's the one I personally have biggest
use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can pull
some tricks to lower that a bit.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#69)
Re: tracking commit timestamps

On Fri, Nov 7, 2014 at 7:07 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

The list of what is useful might be long,

That's FUD. It might also be short.

but we can't have everything there
as there are space constraints, and LSN is another 8 bytes and I still want
to have some bytes for storing the "origin" or whatever you want to call it
there, as that's the one I personally have biggest use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can pull some
tricks to lower that a bit.

It won't do to say "let's do the things that I want, and foreclose
forever the things that other people want". I find it quite hard to
believe that 16 bytes per transaction is a perfectly tolerable
overhead but 24 bytes per transaction will break the bank. But if
that is really true then we ought to reject this patch altogether,
because it's unacceptable, in any arena, for a patch that only
benefits extensions to consume all of the available bit-space in,
leaving none for future core needs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#70)
Re: tracking commit timestamps

On 08/11/14 03:05, Robert Haas wrote:

On Fri, Nov 7, 2014 at 7:07 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

but we can't have everything there
as there are space constraints, and LSN is another 8 bytes and I still want
to have some bytes for storing the "origin" or whatever you want to call it
there, as that's the one I personally have biggest use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can pull some
tricks to lower that a bit.

It won't do to say "let's do the things that I want, and foreclose
forever the things that other people want". I find it quite hard to
believe that 16 bytes per transaction is a perfectly tolerable
overhead but 24 bytes per transaction will break the bank. But if
that is really true then we ought to reject this patch altogether,
because it's unacceptable, in any arena, for a patch that only
benefits extensions to consume all of the available bit-space in,
leaving none for future core needs.

That's not what I said. I am actually ok with adding the LSN if people
see it useful.
I was just wondering if we can make the record smaller somehow - 24bytes
per txid is around 96GB of data for whole txid range and won't work with
pages smaller than ~4kBs unless we add 6 char support to SLRU (which is
not hard and we could also not allow track_commit_timestamps to be
turned on with smaller pagesize...).

I remember somebody was worried about this already during the original
patch submission and it can't be completely ignored in the discussion
about adding more stuff into the record.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#71)
Re: tracking commit timestamps

On Sat, Nov 8, 2014 at 5:35 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

That's not what I said. I am actually ok with adding the LSN if people see
it useful.
I was just wondering if we can make the record smaller somehow - 24bytes per
txid is around 96GB of data for whole txid range and won't work with pages
smaller than ~4kBs unless we add 6 char support to SLRU (which is not hard
and we could also not allow track_commit_timestamps to be turned on with
smaller pagesize...).

I remember somebody was worried about this already during the original patch
submission and it can't be completely ignored in the discussion about adding
more stuff into the record.

Fair point. Sorry I misunderstood.

I think the key question here is the time for which the data needs to
be retained. 2^32 of anything is a lot, but why keep around that
number of records rather than more (after all, we have epochs to
distinguish one use of a given txid from another) or fewer? Obvious
alternatives include:

- Keep the data for some period of time; discard the data when it's
older than some threshold.
- Keep a certain amount of total data; every time we create a new
file, discard the oldest one.
- Let consumers of the data say how much they need, and throw away
data when it's no longer needed by the oldest consumer.
- Some combination of the above.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Steve Singer
steve@ssinger.info
In reply to: Petr Jelinek (#69)
Re: tracking commit timestamps

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I
still want to have some bytes for storing the "origin" or whatever you
want to call it there, as that's the one I personally have biggest
use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can
pull some tricks to lower that a bit.

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that. If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#72)
Re: tracking commit timestamps

Robert Haas wrote:

I think the key question here is the time for which the data needs to
be retained. 2^32 of anything is a lot, but why keep around that
number of records rather than more (after all, we have epochs to
distinguish one use of a given txid from another) or fewer?

The problem is not how much data we retain; is about how much data we
can address. We must be able to address the data for transaction with
xid=2^32-1, even if you only retain the 1000 most recent transactions.
In fact, we already only retain data back to RecentXmin, if I recall
correctly. All slru.c users work that way.

Back when pg_multixact/members had the 5-char issue, I came up with a
patch that had each slru.c user declare how many chars maximum were the
filenames. I didn't push further with that because there was an issue
with it, I don't remember what it was offhand (but I don't think I
posted it). But this is only needed so that the filenames are all equal
width, which is mostly cosmetical; the rest of the module works fine
with 4- or 5-char filenames, and can be trivially expanded to support 6
or more.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Anssi Kääriäinen
anssi.kaariainen@thl.fi
In reply to: Steve Singer (#73)
Re: tracking commit timestamps

On Sun, 2014-11-09 at 11:57 -0500, Steve Singer wrote:

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that. If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.

There is no guarantee that a commit with later LSN has a later
timestamp. There are cases where the clock could move significantly
backwards.

A robust solution to storing transaction commit information (including
commit order) in a way that can be referenced from other tables, can be
loaded to another cluster, and survives crashes would be a great
feature. But this feature doesn't have those properties.

- Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Petr Jelinek
petr@2ndquadrant.com
In reply to: Anssi Kääriäinen (#75)
Re: tracking commit timestamps

On 10/11/14 08:01, Anssi Kääriäinen wrote:

On Sun, 2014-11-09 at 11:57 -0500, Steve Singer wrote:

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that. If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.

There is no guarantee that a commit with later LSN has a later
timestamp. There are cases where the clock could move significantly
backwards.

A robust solution to storing transaction commit information (including
commit order) in a way that can be referenced from other tables, can be
loaded to another cluster, and survives crashes would be a great
feature. But this feature doesn't have those properties.

It has the property of surviving crashes.

Not sure what you mean by referencing from other tables?

And about loading to another cluster, the txid does not really have any
meaning on another cluster, so the info about it does not have either?

But anyway this patch is targeting extensions not DBAs - you could write
extension that will provide that feature on top of this patch (although
given what I wrote above I don't see how it's useful).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#74)
Re: tracking commit timestamps

On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

I think the key question here is the time for which the data needs to
be retained. 2^32 of anything is a lot, but why keep around that
number of records rather than more (after all, we have epochs to
distinguish one use of a given txid from another) or fewer?

The problem is not how much data we retain; is about how much data we
can address.

I thought I was responding to a concern about disk space utilization.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Robert Haas
robertmhaas@gmail.com
In reply to: Anssi Kääriäinen (#75)
Re: tracking commit timestamps

On Mon, Nov 10, 2014 at 2:01 AM, Anssi Kääriäinen
<anssi.kaariainen@thl.fi> wrote:

There is no guarantee that a commit with later LSN has a later
timestamp. There are cases where the clock could move significantly
backwards.

Good point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Petr Jelinek
petr@2ndquadrant.com
In reply to: Steve Singer (#73)
Re: tracking commit timestamps

On 09/11/14 17:57, Steve Singer wrote:

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I
still want to have some bytes for storing the "origin" or whatever you
want to call it there, as that's the one I personally have biggest
use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can
pull some tricks to lower that a bit.

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that. If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.

Hmm maybe just one part of LSN, but I don't really like that either, if
we want to store LSN we should probably store it as is as somebody might
want to map it to txid for other reasons.

I did the calculation above wrong btw, it's actually 20 bytes not 24
bytes per record, I am inclined to just say we can live with that.

Since we agreed that the (B) case is not really feasible and we are
doing the (C), I also wonder if extradata should be renamed to nodeid
(even if it's not used at this point as nodeid). And then there is
question about the size of it, since the nodeid itself can live with 2
bytes probably ("64k of nodes ought to be enough for everybody" ;) ).
Or leave the extradata as is but use as reserved space for future use
and not expose it at this time on SQL level at all?

I guess Andres could answer what suits him better here.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#77)
Re: tracking commit timestamps

Robert Haas wrote:

On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

I think the key question here is the time for which the data needs to
be retained. 2^32 of anything is a lot, but why keep around that
number of records rather than more (after all, we have epochs to
distinguish one use of a given txid from another) or fewer?

The problem is not how much data we retain; is about how much data we
can address.

I thought I was responding to a concern about disk space utilization.

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#79)
Re: tracking commit timestamps

On Mon, Nov 10, 2014 at 8:39 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes
per record, I am inclined to just say we can live with that.

If you do it as 20 bytes, you'll have to do some work to squeeze out
the alignment padding. I'm inclined to think it's fine to have a few
extra padding bytes here; someone might want to use those for
something in the future, and they probably don't cost much.

Since we agreed that the (B) case is not really feasible and we are doing
the (C), I also wonder if extradata should be renamed to nodeid (even if
it's not used at this point as nodeid). And then there is question about the
size of it, since the nodeid itself can live with 2 bytes probably ("64k of
nodes ought to be enough for everybody" ;) ).
Or leave the extradata as is but use as reserved space for future use and
not expose it at this time on SQL level at all?

I vote for calling it node-ID, and for allowing at least 4 bytes for
it. Penny wise, pound foolish.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#80)
Re: tracking commit timestamps

On Mon, Nov 10, 2014 at 8:40 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

I'm not sure whether keeping it just back to RecentXmin will be enough
for everybody's needs. But we certainly don't need to keep the last
2^32 records as someone-or-other was suggesting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Steve Singer
steve@ssinger.info
In reply to: Petr Jelinek (#79)
Re: tracking commit timestamps

On 11/10/2014 08:39 AM, Petr Jelinek wrote:

On 09/11/14 17:57, Steve Singer wrote:

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I
still want to have some bytes for storing the "origin" or whatever you
want to call it there, as that's the one I personally have biggest
use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can
pull some tricks to lower that a bit.

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that. If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.

Hmm maybe just one part of LSN, but I don't really like that either,
if we want to store LSN we should probably store it as is as somebody
might want to map it to txid for other reasons.

I did the calculation above wrong btw, it's actually 20 bytes not 24
bytes per record, I am inclined to just say we can live with that.

Since we agreed that the (B) case is not really feasible and we are
doing the (C), I also wonder if extradata should be renamed to nodeid
(even if it's not used at this point as nodeid). And then there is
question about the size of it, since the nodeid itself can live with 2
bytes probably ("64k of nodes ought to be enough for everybody" ;) ).
Or leave the extradata as is but use as reserved space for future use
and not expose it at this time on SQL level at all?

I guess Andres could answer what suits him better here.

I am happy with renaming extradata to nodeid and not exposing it at this
time.

If we feel that commit-order (ie LSN or something equivalent) is really
a different patch/feature than commit-timestamp then I am okay with that
also but we should make sure to warn users of the commit-timestamp in
the documentation that two transactions might have the same timestamp
and that the commit order might not be the same as ordering by the
commit timestamp.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#38)
Re: tracking commit timestamps

On 4 November 2014 08:23, Andres Freund <andres@2ndquadrant.com> wrote:

6) Shouldn't any value update of track_commit_timestamp be tracked in
XLogReportParameters? That's thinking about making the commit timestamp
available on standbys as well..

Yes, it should.

Agree committs should be able to run on standby, but it seems possible
to do that without it running on the master. The two should be
unconnected.

Not sure why we'd want to have parameter changes on master reported?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Andres Freund
andres@2ndquadrant.com
In reply to: Simon Riggs (#84)
Re: tracking commit timestamps

On 2014-11-11 16:10:47 +0000, Simon Riggs wrote:

On 4 November 2014 08:23, Andres Freund <andres@2ndquadrant.com> wrote:

6) Shouldn't any value update of track_commit_timestamp be tracked in
XLogReportParameters? That's thinking about making the commit timestamp
available on standbys as well..

Yes, it should.

Agree committs should be able to run on standby, but it seems possible
to do that without it running on the master.

I don't think that's realistic. It requires WAL to be written in some
cases, so that's not going to work. I also don't think it's a
particularly interesting ability?

The two should be unconnected.

Why?

Not sure why we'd want to have parameter changes on master reported?

So it works correctly. We're currently truncating the slru on startup
when the guc is disabled which would cause problems WAL records coming
in from the primary. I think the code also needs some TLC to correctly
work after a failover.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Simon Riggs
simon@2ndQuadrant.com
In reply to: Steve Singer (#73)
Re: tracking commit timestamps

On 9 November 2014 16:57, Steve Singer <steve@ssinger.info> wrote:

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I still
want to have some bytes for storing the "origin" or whatever you want to
call it there, as that's the one I personally have biggest use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can pull
some tricks to lower that a bit.

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim pointed
out earlier in the thread that just ordering on timestamp allows for
multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that. If you already have the
commit timestamp maybe you only need another byte or two of granularity to
order transactions that are within the same microsecond.

It looks like there are quite a few potential uses for this.

If we include everything it will be too fat to use for any of the
potential uses, since each will be pulled down by the others.

Sounds like it needs to be configurable in some way.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#85)
Re: tracking commit timestamps

On 11 November 2014 16:19, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-11-11 16:10:47 +0000, Simon Riggs wrote:

On 4 November 2014 08:23, Andres Freund <andres@2ndquadrant.com> wrote:

6) Shouldn't any value update of track_commit_timestamp be tracked in
XLogReportParameters? That's thinking about making the commit timestamp
available on standbys as well..

Yes, it should.

Agree committs should be able to run on standby, but it seems possible
to do that without it running on the master.

I don't think that's realistic. It requires WAL to be written in some
cases, so that's not going to work. I also don't think it's a
particularly interesting ability?

OK, so we are saying commit timestamp will NOT be available on Standbys.

I'm fine with that, since data changes aren't generated there.

So it works correctly. We're currently truncating the slru on startup
when the guc is disabled which would cause problems WAL records coming
in from the primary. I think the code also needs some TLC to correctly
work after a failover.

OK

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Andres Freund
andres@2ndquadrant.com
In reply to: Simon Riggs (#87)
Re: tracking commit timestamps

On 2014-11-11 17:09:54 +0000, Simon Riggs wrote:

On 11 November 2014 16:19, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-11-11 16:10:47 +0000, Simon Riggs wrote:

On 4 November 2014 08:23, Andres Freund <andres@2ndquadrant.com> wrote:

6) Shouldn't any value update of track_commit_timestamp be tracked in
XLogReportParameters? That's thinking about making the commit timestamp
available on standbys as well..

Yes, it should.

Agree committs should be able to run on standby, but it seems possible
to do that without it running on the master.

I don't think that's realistic. It requires WAL to be written in some
cases, so that's not going to work. I also don't think it's a
particularly interesting ability?

OK, so we are saying commit timestamp will NOT be available on Standbys.

Hm? They should be available - xact.c WAL replay will redo the setting
of the timestamps and explicitly overwritten timestamps will generate
their own WAL records. What I mean is just that you can't use commit
timestamps without also using it on the primary.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#80)
Re: tracking commit timestamps

On 11/10/14, 7:40 AM, Alvaro Herrera wrote:

Robert Haas wrote:

On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

I think the key question here is the time for which the data needs to
be retained. 2^32 of anything is a lot, but why keep around that
number of records rather than more (after all, we have epochs to
distinguish one use of a given txid from another) or fewer?

The problem is not how much data we retain; is about how much data we
can address.

I thought I was responding to a concern about disk space utilization.

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

FWIW, AFAICS MultiXacts are only truncated after a (auto)vacuum process is able to advance datminmxid, which will (now) only happen when an entire relation has been scanned (which should be infrequent).

I believe the low normal space usage is just an indication that most databases don't use many MultiXacts.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#89)
Re: tracking commit timestamps

Jim Nasby wrote:

On 11/10/14, 7:40 AM, Alvaro Herrera wrote:

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

FWIW, AFAICS MultiXacts are only truncated after a (auto)vacuum process is able to advance datminmxid, which will (now) only happen when an entire relation has been scanned (which should be infrequent).

I believe the low normal space usage is just an indication that most databases don't use many MultiXacts.

That's in 9.3. Prior to that, they were truncated much more often.
Maybe you've not heard enough about this commit:

commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed Jan 23 12:04:59 2013 -0300

Improve concurrency of foreign key locking

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#90)
Re: tracking commit timestamps

On 11/11/14, 2:03 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

On 11/10/14, 7:40 AM, Alvaro Herrera wrote:

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

FWIW, AFAICS MultiXacts are only truncated after a (auto)vacuum process is able to advance datminmxid, which will (now) only happen when an entire relation has been scanned (which should be infrequent).

I believe the low normal space usage is just an indication that most databases don't use many MultiXacts.

That's in 9.3. Prior to that, they were truncated much more often.

Well, we're talking about a new feature, so I wasn't looking in back branches. ;P

Maybe you've not heard enough about this commit:

commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182

Interestingly, git.postgresql.org hasn't either: http://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=0ac5ad5134f2769ccbaefec73844f8504c4d6182

The commit is certainly there though...
decibel@decina:[15:12]~/pgsql/HEAD/src/backend (master=)$git log 0ac5ad5134f2769ccbaefec73844f8504c4d6182|head -n1
commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jim Nasby (#91)
git.postgresql.org not finding a commit

Details below, but http://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=0ac5ad5134f2769ccbaefec73844f8504c4d6182 shows nothing, but that commit does exist:

decibel@decina:[15:12]~/pgsql/HEAD/src/backend (master=)$git log 0ac5ad5134f2769ccbaefec73844f8504c4d6182|head -n1
commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182

Our github mirror doesn't show that commit in it's search either :(

-------- Original Message --------
Subject: Re: [HACKERS] tracking commit timestamps
Date: Tue, 11 Nov 2014 15:18:17 -0600
From: Jim Nasby <Jim.Nasby@BlueTreble.com>
To: Alvaro Herrera <alvherre@2ndquadrant.com>
CC: Robert Haas <robertmhaas@gmail.com>, Petr Jelinek <petr@2ndquadrant.com>, Steve Singer <steve@ssinger.info>, Andres Freund <andres@2ndquadrant.com>, Michael Paquier <michael.paquier@gmail.com>, Anssi K��ri�inen <anssi.kaariainen@thl.fi>, Simon Riggs <simon@2ndquadrant.com>, Heikki Linnakangas <hlinnakangas@vmware.com>, "Pg Hackers" <pgsql-hackers@postgresql.org>, Jaime Casanova <jaime@2ndquadrant.com>

On 11/11/14, 2:03 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

On 11/10/14, 7:40 AM, Alvaro Herrera wrote:

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

FWIW, AFAICS MultiXacts are only truncated after a (auto)vacuum process is able to advance datminmxid, which will (now) only happen when an entire relation has been scanned (which should be infrequent).

I believe the low normal space usage is just an indication that most databases don't use many MultiXacts.

That's in 9.3. Prior to that, they were truncated much more often.

Well, we're talking about a new feature, so I wasn't looking in back branches. ;P

Maybe you've not heard enough about this commit:

commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182

Interestingly, git.postgresql.org hasn't either: http://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=0ac5ad5134f2769ccbaefec73844f8504c4d6182

The commit is certainly there though...
decibel@decina:[15:12]~/pgsql/HEAD/src/backend (master=)$git log 0ac5ad5134f2769ccbaefec73844f8504c4d6182|head -n1
commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www

#93Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#92)
Re: git.postgresql.org not finding a commit

Jim Nasby wrote:

Details below, but http://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=0ac5ad5134f2769ccbaefec73844f8504c4d6182 shows nothing, but that commit does exist:

decibel@decina:[15:12]~/pgsql/HEAD/src/backend (master=)$git log 0ac5ad5134f2769ccbaefec73844f8504c4d6182|head -n1
commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182

Our github mirror doesn't show that commit in it's search either :(

No idea what "search" does, but it doesn't work on a commit ID. This
works:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0ac5ad5134f2769ccbaefec73844f8504c4d6182

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www

#94Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#91)
Re: tracking commit timestamps

Jim Nasby wrote:

On 11/11/14, 2:03 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

On 11/10/14, 7:40 AM, Alvaro Herrera wrote:

Ah, right. So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old. If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in <= 9.2 for his customers; that one was also
limited to RecentXmin AFAIR. I think a similar volume of commit_ts data
would be pretty acceptable. Moreso considering that it's turned off by
default.

FWIW, AFAICS MultiXacts are only truncated after a (auto)vacuum process is able to advance datminmxid, which will (now) only happen when an entire relation has been scanned (which should be infrequent).

I believe the low normal space usage is just an indication that most databases don't use many MultiXacts.

That's in 9.3. Prior to that, they were truncated much more often.

Well, we're talking about a new feature, so I wasn't looking in back branches. ;P

Well, I did mention <= 9.2 in the text above ...

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#93)
Re: git.postgresql.org not finding a commit

On 11/11/14, 3:55 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

Details below, but http://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=0ac5ad5134f2769ccbaefec73844f8504c4d6182 shows nothing, but that commit does exist:

decibel@decina:[15:12]~/pgsql/HEAD/src/backend (master=)$git log 0ac5ad5134f2769ccbaefec73844f8504c4d6182|head -n1
commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182

Our github mirror doesn't show that commit in it's search either :(

No idea what "search" does, but it doesn't work on a commit ID. This
works:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0ac5ad5134f2769ccbaefec73844f8504c4d6182

Well, this is rather confusing, because the drop-down by the search box on [1] has a selection for "commit". You'd think that would allow you to search for a specific commit.

Turns out, the help [2] states that the "commit" context for search searches commit author and messages. So I guess it's as expected, albeit confusing. :(

Anyone know how hard it would be to allow a commit "search" to also look for a specific commit hash?

1: http://git.postgresql.org/gitweb/?p=postgresql.git;a=summary
2: http://git.postgresql.org/gitweb/?p=postgresql.git;a=search_help
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www

#96Magnus Hagander
magnus@hagander.net
In reply to: Jim Nasby (#95)
Re: git.postgresql.org not finding a commit

On Wed, Nov 12, 2014 at 12:08 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 11/11/14, 3:55 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

Details below, but
http://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=0ac5ad5134f2769ccbaefec73844f8504c4d6182
shows nothing, but that commit does exist:

decibel@decina:[15:12]~/pgsql/HEAD/src/backend (master=)$git log
0ac5ad5134f2769ccbaefec73844f8504c4d6182|head -n1
commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182

Our github mirror doesn't show that commit in it's search either :(

No idea what "search" does, but it doesn't work on a commit ID. This
works:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0ac5ad5134f2769ccbaefec73844f8504c4d6182

Well, this is rather confusing, because the drop-down by the search box on
[1] has a selection for "commit". You'd think that would allow you to search
for a specific commit.

Turns out, the help [2] states that the "commit" context for search searches
commit author and messages. So I guess it's as expected, albeit confusing.
:(

Anyone know how hard it would be to allow a commit "search" to also look for
a specific commit hash?

Preferably submit it for inclusion *upstream* as a feature. We'd
rather not end up forking gitweb.

(And while at it, feel free to fix it to be less super-slow :P)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www

#97Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#81)
1 attachment(s)
Re: tracking commit timestamps

On 10/11/14 14:53, Robert Haas wrote:

On Mon, Nov 10, 2014 at 8:39 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes
per record, I am inclined to just say we can live with that.

If you do it as 20 bytes, you'll have to do some work to squeeze out
the alignment padding. I'm inclined to think it's fine to have a few
extra padding bytes here; someone might want to use those for
something in the future, and they probably don't cost much.

I did get around the alignment via memcpy, so it is still 20bytes.

Since we agreed that the (B) case is not really feasible and we are doing
the (C), I also wonder if extradata should be renamed to nodeid (even if
it's not used at this point as nodeid). And then there is question about the
size of it, since the nodeid itself can live with 2 bytes probably ("64k of
nodes ought to be enough for everybody" ;) ).
Or leave the extradata as is but use as reserved space for future use and
not expose it at this time on SQL level at all?

I vote for calling it node-ID, and for allowing at least 4 bytes for
it. Penny wise, pound foolish.

Ok, I went this way.

Anyway here is v8 version of the patch, I think I addressed all the
concerns mentioned, it's also rebased against current master (BRIN
commit added some conflicts).

Brief list of changes:
- the commit timestamp record now stores timestamp, lsn and nodeid
- added code to disallow turning track_commit_timestamp on with too
small pagesize
- the get interfaces error out when track_commit_timestamp is off
- if the xid passed to get interface is out of range -infinity
timestamp is returned (I think it's bad idea to throw errors here as the
valid range is not static and same ID can start throwing errors between
calls theoretically)
- renamed the sql interfaces to pg_xact_commit_timestamp,
pg_xact_commit_timestamp_data and pg_last_committed_xact, they don't
expose the nodeid atm, I personally am not big fan of the "xact" but it
seems more consistent with existing naming
- documented pg_resetxlog changes and make all the pg_resetxlog
options alphabetically ordered
- committs is not used anymore, it's commit_ts (and CommitTs in
camelcase), I think it's not really good idea to spell the timestamp
everywhere as some interface then get 30+ chars long names...
- added WAL logging of the track_commit_timestamp GUC
- added alternative expected output of the regression test so that it
works with make installcheck when track_commit_timestamp is on
- added C interface to set default nodeid for current backend
- several minor comment and naming adjustments mostly suggested by Michael

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v8.patchtext/x-diff; name=committs-v8.patchDownload
diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..e331297 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,6 +50,7 @@ SUBDIRS = \
 		spi		\
 		tablefunc	\
 		tcn		\
+		test_committs	\
 		test_decoding	\
 		test_parser	\
 		test_shm_mq	\
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..e0af3cf 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include "access/brin_xlog.h"
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 0000000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	    --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	    --temp-install=./tmp_check \
+	    --extra-install=contrib/test_committs \
+	    --outputdir=./regression_output \
+	    $(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	    --extra-install=contrib/test_committs \
+	    $(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 0000000..1457a27
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (on)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/contrib/test_committs/sql/committs_on.sql b/contrib/test_committs/sql/committs_on.sql
new file mode 100644
index 0000000..0f2d064
--- /dev/null
+++ b/contrib/test_committs/sql/committs_on.sql
@@ -0,0 +1,18 @@
+--
+-- Commit Timestamp (on)
+--
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bfb7bb..2fef80e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,20 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions. This parameter
+        can only be set in <filename>postgresql.conf</> file or on the server
+        command line. The default value is <literal>off</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b58cfa5..e3ace51 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15918,6 +15918,43 @@ SELECT collation for ('foo' COLLATE "de_DE");
     For example <literal>10:20:10,14,15</literal> means
     <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
    </para>
+
+   <para>
+    The functions shown in <xref linkend="functions-committs">
+    provide information about transactions that have been already committed.
+    These functions mainly provide information about when the transactions
+    were committed. They only provide useful data when
+    <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+    and only for transactions that were committed after it was enabled.
+   </para>
+
+   <table id="functions-committs">
+    <title>Committed transaction information</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry><literal><function>pg_xact_commit_timestamp(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>get commit timestamp of a transaction</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_xact_commit_timestamp_data(<parameter>xid</>)</function></literal></entry>
+       <entry> <parameter>timestamp</> <type>timestamp with time zone</>, <parameter>lsn</> <type>pg_lsn</></entry>
+       <entry>get commit timestamp and lsn of a transaction</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_last_committed_xact()</function></literal></entry>
+       <entry><parameter>xid</> <type>xid</>, <parameter>timestamp</> <type>timestamp with time zone</>, <parameter>lsn</> <type>pg_lsn</></entry>
+       <entry>get transaction Id, commit timestamp and lsn of latest transaction commit</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   </sect1>
 
   <sect1 id="functions-admin">
diff --git a/doc/src/sgml/ref/pg_resetxlog.sgml b/doc/src/sgml/ref/pg_resetxlog.sgml
index aba7185..7117118 100644
--- a/doc/src/sgml/ref/pg_resetxlog.sgml
+++ b/doc/src/sgml/ref/pg_resetxlog.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  <refsynopsisdiv>
   <cmdsynopsis>
    <command>pg_resetxlog</command>
+   <arg choice="opt"><option>-c</option> <replaceable class="parameter">xid</replaceable></arg>
    <arg choice="opt"><option>-f</option></arg>
    <arg choice="opt"><option>-n</option></arg>
    <arg choice="opt"><option>-o</option> <replaceable class="parameter">oid</replaceable></arg>
@@ -78,11 +79,12 @@ PostgreSQL documentation
 
   <para>
    The <option>-o</>, <option>-x</>, <option>-e</>,
-   <option>-m</>, <option>-O</>,
-   and <option>-l</>
+   <option>-m</>, <option>-O</>, <option>-l</>
+   and <option>-e</>
    options allow the next OID, next transaction ID, next transaction ID's
-   epoch, next and oldest multitransaction ID, next multitransaction offset, and WAL
-   starting address values to be set manually.  These are only needed when
+   epoch, next and oldest multitransaction ID, next multitransaction offset, WAL
+   starting address and the oldest transaction ID for which the commit time can
+   be retrieved values to be set manually.  These are only needed when
    <command>pg_resetxlog</command> is unable to determine appropriate values
    by reading <filename>pg_control</>.  Safe values can be determined as
    follows:
@@ -130,6 +132,15 @@ PostgreSQL documentation
 
     <listitem>
      <para>
+      A safe value for the oldest transaction ID for which the commit time can
+      be retrieve (<option>-c</>) can be determined by looking for the
+      numerically smallest file name in the directory <filename>pg_committs</>
+      under the data directory As above, the file names are in hexadecimal.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
       The WAL starting address (<option>-l</>) should be
       larger than any WAL segment file name currently existing in
       the directory <filename>pg_xlog</> under the data directory.
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 32cb985..0daa9bb 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,8 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
-	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
+	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..7802584
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "utils/timestamp.h"
+
+
+void
+commit_ts_desc(StringInfo buf, XLogRecord *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "zeropage: %d", pageno);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "truncate before: %d", pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *xlrec = (xl_commit_ts_set *) rec;
+		int		i;
+
+		appendStringInfo(buf, "set commit_ts %s for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->mainxid);
+		for (i = 0; i < xlrec->nsubxids; i++)
+			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+	}
+}
+
+const char *
+commit_ts_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case COMMIT_TS_ZEROPAGE:
+			id = "ZEROPAGE";
+			break;
+		case COMMIT_TS_TRUNCATE:
+			id = "TRUNCATE";
+			break;
+		case COMMIT_TS_SETTS:
+			id = "SETTS";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e0957ff..9919c52 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 		appendStringInfo(buf, "redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest commit timestamp xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 82a6c76..a1979ca 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o committs.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/committs.c b/src/backend/access/transam/committs.c
new file mode 100644
index 0000000..a50a027
--- /dev/null
+++ b/src/backend/access/transam/committs.c
@@ -0,0 +1,887 @@
+/*-------------------------------------------------------------------------
+ *
+ * committs.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/committs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/* We need 8+4 bytes per xact */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	XLogRecPtr		lsn;
+	NodeIdRec		nodeid;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry sizeof(CommitTimestampEntry)
+
+/* this is limited by how much data we can fit into SLRU cache */
+#define COMMIT_TS_MIN_BLCKSZ 4096
+
+#define COMMIT_TS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CommitTs control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variable */
+bool	track_commit_timestamp;
+
+NodeIdRec CommitTsDefaultNodeId = InvalidNodeId;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 XLogRecPtr lsn, NodeIdRec nodeid, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						  XLogRecPtr lsn, NodeIdRec nodeid, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 XLogRecPtr lsn, NodeIdRec nodeid);
+
+
+/*
+ * CommitTsSetCurrentNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+extern void CommitTsSetDefaultNodeId(NodeIdRec nodeid)
+{
+	CommitTsDefaultNodeId = nodeid;
+}
+
+/*
+ * TransactionTreeSetCommitTsData
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ * The reason why tracking just the parent xid committs is not enough is that
+ * the subtrans SLRU does not stay valid across crashes (is not permanent) so we
+ * need to keep the information about them here. If the subtrans implementation
+ * changes in the future, we might want to revisit the decision of storing
+ * committs for each subxid.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+							   TransactionId *subxids, TimestampTz timestamp,
+							   XLogRecPtr lsn, NodeIdRec nodeid, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, lsn, nodeid);
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, lsn,
+							 nodeid, pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.lsn = lsn;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 XLogRecPtr lsn, NodeIdRec nodeid, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, ts, lsn, nodeid, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], ts, lsn, nodeid, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						 XLogRecPtr lsn, NodeIdRec nodeid, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry entry;
+
+	entry.time = ts;
+	entry.lsn = lsn;
+	entry.nodeid = nodeid;
+
+	memcpy(CommitTsCtl->shared->page_buffer[slotno] +
+				SizeOfCommitTimestampEntry * entryno,
+		   &entry, SizeOfCommitTimestampEntry);
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ */
+void
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 XLogRecPtr *lsn, NodeIdRec *nodeid)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry entry;
+	TransactionId oldestCommitTs;
+
+	/* Error if module not enabled */
+	if (!track_commit_timestamp)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Cannot get commit timestamp data because \"track_commit_timestamp\" is not enabled")));
+	}
+
+	/*
+	 * Return empty if the requested value is older than what we have or
+	 * newer than newest we have.
+	 *
+	 * XXX: should this be error instead?
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs) ||
+		TransactionIdPrecedes(commitTsShared->xidLastCommit, xid))
+	{
+		if (ts)
+			TIMESTAMP_NOBEGIN(*ts);
+		if (lsn)
+			*lsn = InvalidXLogRecPtr;
+		if (nodeid)
+			*nodeid = InvalidNodeId;
+		return;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory;
+	 * if it's a hit, acquire a lock and read the data, after verifying
+	 * that it's still what we initially read.  Otherwise, fall through
+	 * to read from SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (lsn)
+				*lsn = commitTsShared->dataLastCommit.lsn;
+			if (nodeid)
+				*nodeid = commitTsShared->dataLastCommit.nodeid;
+			LWLockRelease(CommitTsLock);
+			return;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	memcpy(&entry,
+		   CommitTsCtl->shared->page_buffer[slotno] +
+				SizeOfCommitTimestampEntry * entryno,
+		   SizeOfCommitTimestampEntry);
+
+	if (ts)
+		*ts = entry.time;
+	if (lsn)
+		*lsn = entry.lsn;
+	if (nodeid)
+		*nodeid = entry.nodeid;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTsData(TimestampTz *ts, XLogRecPtr *lsn, NodeIdRec *nodeid)
+{
+	TransactionId	xid;
+
+	/* Return empty if module not enabled */
+	if (!track_commit_timestamp)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Cannot get commit timestamp data because \"track_commit_timestamp\" is not enabled")));
+	}
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (lsn)
+		*lsn = commitTsShared->dataLastCommit.lsn;
+	if (nodeid)
+		*nodeid = commitTsShared->dataLastCommit.nodeid;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+Datum
+pg_xact_commit_timestamp(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		ts;
+
+	TransactionIdGetCommitTsData(xid, &ts, NULL, NULL);
+
+	PG_RETURN_TIMESTAMPTZ(ts);
+}
+
+Datum
+pg_xact_commit_timestamp_data(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		ts;
+	XLogRecPtr		lsn;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "lsn",
+					   LSNOID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	TransactionIdGetCommitTsData(xid, &ts, &lsn, NULL);
+
+	values[0] = TimestampTzGetDatum(ts);
+	nulls[0] = false;
+
+	values[1] = LSNGetDatum(lsn);
+	nulls[1] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+Datum
+pg_last_committed_xact(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		ts;
+	XLogRecPtr		lsn;
+	Datum       values[3];
+	bool        nulls[3];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(3, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "lsn",
+					   LSNOID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTsData(&ts, &lsn, NULL);
+
+	values[0] = TransactionIdGetDatum(xid);
+	nulls[0] = false;
+
+	values[1] = TimestampTzGetDatum(ts);
+	nulls[1] = false;
+
+	values[2] = LSNGetDatum(lsn);
+	nulls[2] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Initialization of shared memory for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_commit_ts");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
+		commitTsShared->dataLastCommit.lsn = InvalidXLogRecPtr;
+		commitTsShared->dataLastCommit.nodeid = InvalidNodeId;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.
+ * Must be called after recovery has finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+InitCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!track_commit_timestamp)
+	{
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsControlLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest value to the next Xid; that way, we will not try
+	 * to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+	}
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ *
+ * NB2: the current implementation relies on the fact that
+ * track_commit_timestamp is flagged as PGC_POSTMASTER
+ * (only possible to be set at server start).
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the earliest value for which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_ZEROPAGE, &rdata);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_TRUNCATE, &rdata);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 XLogRecPtr lsn, NodeIdRec nodeid)
+{
+	XLogRecData			rdata;
+	xl_commit_ts_set	record;
+
+	record.timestamp = timestamp;
+	record.lsn = lsn;
+	record.nodeid = nodeid;
+	record.mainxid = mainxid;
+	record.nsubxids = nsubxids;
+	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+
+	rdata.data = (char *) &record;
+	rdata.len = offsetof(xl_commit_ts_set, subxids) +
+		nsubxids * sizeof(TransactionId);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_SETTS, &rdata);
+}
+
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+commit_ts_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in commit_ts records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *setts = (xl_commit_ts_set *) XLogRecGetData(record);
+
+		TransactionTreeSetCommitTsData(setts->mainxid, setts->nsubxids,
+									   setts->subxids, setts->timestamp,
+									   setts->lsn, setts->nodeid, false);
+	}
+	else
+		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
+}
+
+/*
+ * Helper function for GUC
+ *
+ * Check if we can enable the track_commit_timestamp.
+ */
+bool
+check_track_commit_timestamp(bool *newval, void **extra, GucSource source)
+{
+	if (*newval && BLCKSZ < COMMIT_TS_MIN_BLCKSZ)
+	{
+		GUC_check_errmsg("Commit timestamps tacking cannot be enabled for builds with page size smaller than %d",
+						 COMMIT_TS_MIN_BLCKSZ);
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index befd60f..f24861c 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index d51cca4..d3287da 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -158,9 +159,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_committs too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6f92bad..fc5f7c9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1168,6 +1169,18 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We don't need to log the commit timestamp separately since the commit
+	 * record logged above has all the necessary action to set the timestamp
+	 * again.
+	 */
+	if (markXidCommitted)
+	{
+		TransactionTreeSetCommitTsData(xid, nchildren, children,
+									   xactStopTimestamp, XactLastRecEnd,
+									   CommitTsDefaultNodeId, false);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4683,6 +4696,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4710,6 +4724,11 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit timestamp and metadata */
+	TransactionTreeSetCommitTsData(xid, nsubxacts, sub_xids,
+								   commit_time, lsn,
+								   CommitTsDefaultNodeId, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4829,7 +4848,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4844,7 +4864,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 99f702c..02b1dca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4520,6 +4521,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4529,6 +4531,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -4602,6 +4605,7 @@ BootStrapXLOG(void)
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
+	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
@@ -4610,6 +4614,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -5858,6 +5863,9 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("oldest commit timestamp Xid: %u",
+					checkPoint.oldestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -5869,6 +5877,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6091,11 +6100,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans
+			 * only. MultiXact has already been started up and other SLRUs are
+			 * not maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -6742,12 +6752,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -6783,6 +6794,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enables, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	InitCommitTs();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7347,6 +7364,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -7674,6 +7692,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -7959,6 +7981,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
@@ -8399,7 +8422,8 @@ XLogReportParameters(void)
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
-		max_locks_per_xact != ControlFile->max_locks_per_xact)
+		max_locks_per_xact != ControlFile->max_locks_per_xact ||
+		track_commit_timestamp != ControlFile->track_commit_timestamp)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -8420,6 +8444,7 @@ XLogReportParameters(void)
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
+			xlrec.track_commit_timestamp = track_commit_timestamp;
 
 			rdata.buffer = InvalidBuffer;
 			rdata.data = (char *) &xlrec;
@@ -8436,6 +8461,7 @@ XLogReportParameters(void)
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 		UpdateControlFile();
 	}
 }
@@ -8815,6 +8841,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6384dc7..23b5248 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1071,10 +1072,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG to the oldest computed value.  Note we don't truncate
-	 * multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG and CommitTs to the oldest computed value.
+	 * Note we don't truncate multixacts; that will be done by the next
+	 * checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1084,6 +1087,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	SetCommitTsLimit(frozenXID);
 }
 
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8e78aaf..44898ab 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -133,6 +133,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
+		case RM_COMMIT_TS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..9025601 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 719181c..4b4b4bf 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* committs.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index db65c76..df6c952 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -826,6 +827,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&track_commit_timestamp,
+		false,
+		check_track_commit_timestamp, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6e8ea1e..4da89a6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -227,6 +227,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 0	# max number of replication slots
+#track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
 
 # - Master Server -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index dc1f1df..28e6dfd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,6 +185,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index b2e0793..a838bb5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -270,6 +270,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
@@ -300,6 +302,8 @@ main(int argc, char *argv[])
 		   ControlFile.max_prepared_xacts);
 	printf(_("Current max_locks_per_xact setting:   %d\n"),
 		   ControlFile.max_locks_per_xact);
+	printf(_("Current track_commit_timestamp setting: %s\n"),
+		   ControlFile.track_commit_timestamp ? _("on") : _("off"));
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 2ba9946..a6bd8d5 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -63,6 +63,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_commit_ts = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -112,7 +113,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "c:D:e:fl:m:no:O:x:")) != -1)
 	{
 		switch (c)
 		{
@@ -158,6 +159,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_commit_ts = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_commit_ts == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -345,6 +361,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_commit_ts != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_commit_ts;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -539,6 +558,7 @@ GuessControlValues(void)
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -621,6 +641,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -702,6 +724,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_commit_ts != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -739,6 +767,7 @@ RewriteControlFile(void)
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -1095,6 +1124,7 @@ usage(void)
 	printf(_("%s resets the PostgreSQL transaction log.\n\n"), progname);
 	printf(_("Usage:\n  %s [OPTION]... {[-D] DATADIR}\n\n"), progname);
 	printf(_("Options:\n"));
+	printf(_("  -c XID           set the oldest transaction with retrievable commit timestamp\n"));
 	printf(_("  -e XIDEPOCH      set next transaction ID epoch\n"));
 	printf(_("  -f               force update to be done\n"));
 	printf(_("  -l XLOGFILE      force minimum WAL starting location for new transaction log\n"));
diff --git a/src/include/access/committs.h b/src/include/access/committs.h
new file mode 100644
index 0000000..04e8203
--- /dev/null
+++ b/src/include/access/committs.h
@@ -0,0 +1,75 @@
+/*
+ * committs.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/committs.h
+ */
+#ifndef COMMITTS_H
+#define COMMITTS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+#include "utils/guc.h"
+
+extern PGDLLIMPORT bool	track_commit_timestamp;
+extern bool check_track_commit_timestamp(bool *newval, void **extra,
+										 GucSource source);
+
+typedef uint32 NodeIdRec;
+
+#define InvalidNodeId 0
+
+extern NodeIdRec CommitTsDefaultNodeId;
+
+extern void CommitTsSetDefaultNodeId(NodeIdRec nodeid);
+extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+										   TransactionId *subxids,
+										   TimestampTz timestamp,
+										   XLogRecPtr lsn,
+										   NodeIdRec nodeid,
+										   bool do_xlog);
+extern void TransactionIdGetCommitTsData(TransactionId xid,
+										 TimestampTz *ts,
+										 XLogRecPtr *lsn,
+										 NodeIdRec *nodeid);
+extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
+										   XLogRecPtr *lsn,
+										   NodeIdRec *nodeid);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void InitCommitTs(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMIT_TS_ZEROPAGE		0x00
+#define COMMIT_TS_TRUNCATE		0x10
+#define COMMIT_TS_SETTS			0x20
+
+typedef struct xl_commit_ts_set
+{
+	TimestampTz		timestamp;
+	XLogRecPtr	    lsn;
+	NodeIdRec		nodeid;
+	TransactionId	mainxid;
+	int				nsubxids;
+	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+} xl_commit_ts_set;
+
+
+extern void commit_ts_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void commit_ts_desc(StringInfo buf, XLogRecord *record);
+extern const char *commit_ts_identify(uint8 info);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 76a6421..27168c3 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,7 +24,7 @@
  * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, startup, cleanup */
+/* symbol name, textual name, redo, desc, identify, startup, cleanup */
 PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
 PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
 PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
@@ -43,3 +43,4 @@ PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_start
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..b59fd98 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,11 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsControlLock
+	 */
+	TransactionId oldestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 19b2ef8..56203b9 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -186,6 +186,7 @@ typedef struct xl_parameter_change
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
+	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ba79d25..70afbd1 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,7 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
@@ -176,6 +177,7 @@ typedef struct ControlFileData
 	int			max_worker_processes;
 	int			max_prepared_xacts;
 	int			max_locks_per_xact;
+	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5d4e889..47d9f01 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3017,6 +3017,15 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3581 ( pg_xact_commit_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_xact_commit_timestamp _null_ _null_ _null_ ));
+DESCR("get commit timestamp of a transaction");
+
+DATA(insert OID = 3582 ( pg_xact_commit_timestamp_data PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 2249 "28" "{28,1184,3220}" "{i,o,o}" "{xid,timestamp,lsn}" _null_ pg_xact_commit_timestamp_data _null_ _null_ _null_ ));
+DESCR("get commit timestamp and lsn of a transaction");
+
+DATA(insert OID = 3583 ( pg_last_committed_xact PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184,3220}" "{o,o,o}" "{xid,timestamp,lsn}" _null_ pg_last_committed_xact _null_ _null_ _null_ ));
+DESCR("get transaction Id, commit timestamp and lsn of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 91cab87..09654a8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 3ba34f8..2618e7e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1182,6 +1182,11 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_xact_commit_timestamp(PG_FUNCTION_ARGS);
+extern Datum pg_xact_commit_timestamp_data(PG_FUNCTION_ARGS);
+extern Datum pg_last_committed_xact(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/committs.out b/src/test/regress/expected/committs.out
new file mode 100644
index 0000000..77d7a61
--- /dev/null
+++ b/src/test/regress/expected/committs.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (off)
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ off
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
+DROP TABLE committs_test;
diff --git a/src/test/regress/expected/committs_1.out b/src/test/regress/expected/committs_1.out
new file mode 100644
index 0000000..1457a27
--- /dev/null
+++ b/src/test/regress/expected/committs_1.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (on)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d4f02e5..ec0a7c9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,7 +88,7 @@ test: brin privileges security_label collate matview lock replica_identity rowse
 # ----------
 # Another group of parallel tests
 # ----------
-test: alter_generic misc psql async
+test: alter_generic misc psql async committs
 
 # rules cannot run concurrently with any test that creates a view
 test: rules
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 611b0a8..b0c4f39 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -148,3 +148,4 @@ test: largeobject
 test: with
 test: xml
 test: stats
+test: committs
diff --git a/src/test/regress/sql/committs.sql b/src/test/regress/sql/committs.sql
new file mode 100644
index 0000000..321a30a
--- /dev/null
+++ b/src/test/regress/sql/committs.sql
@@ -0,0 +1,20 @@
+--
+-- Commit Timestamp (off)
+--
+
+SHOW track_commit_timestamp;
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
#98Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Petr Jelinek (#97)
Re: tracking commit timestamps

On 11/12/14, 7:06 AM, Petr Jelinek wrote:

- if the xid passed to get interface is out of range -infinity timestamp is returned (I think it's bad idea to throw errors here as the valid range is not static and same ID can start throwing errors between calls theoretically)

Wouldn't NULL be more appropriate?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Michael Paquier
michael.paquier@gmail.com
In reply to: Jim Nasby (#98)
Re: tracking commit timestamps

On Thu, Nov 13, 2014 at 7:56 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 11/12/14, 7:06 AM, Petr Jelinek wrote:

- if the xid passed to get interface is out of range -infinity timestamp
is returned (I think it's bad idea to throw errors here as the valid range
is not static and same ID can start throwing errors between calls
theoretically)

Wouldn't NULL be more appropriate?

Definitely. Defining a given value for information not valid is awkward.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Michael Paquier
michael.paquier@gmail.com
In reply to: Petr Jelinek (#97)
Re: tracking commit timestamps

On Wed, Nov 12, 2014 at 10:06 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Brief list of changes:
- the commit timestamp record now stores timestamp, lsn and nodeid

Now that not only the commit timestamp is stored, calling that "commit
timestamp", "committs" or "commit_timestamp" is strange, no? If this
patch is moving toward being a more complex information provider,
calling it commit information or commit data is more adapted, no?
Documentation would need a fresh brush as well in this case.

- added code to disallow turning track_commit_timestamp on with too small
pagesize
- the get interfaces error out when track_commit_timestamp is off

OK, that's sane.

- if the xid passed to get interface is out of range -infinity timestamp is
returned (I think it's bad idea to throw errors here as the valid range is
not static and same ID can start throwing errors between calls
theoretically)

Already mentioned by Jim in a previous mail: this would be better as NULL.

- renamed the sql interfaces to pg_xact_commit_timestamp,
pg_xact_commit_timestamp_data and pg_last_committed_xact, they don't expose
the nodeid atm, I personally am not big fan of the "xact" but it seems more
consistent with existing naming

pg_xact_commit_timestamp and pg_xact_commit_timestamp_data are
overlapping. What's wrong with a single function able to return the
whole set (node ID, commit timetamp, commit LSN)? Let's say
pg_xact_commit_information or pg_xact_commit_data. Already mentioned,
but I also find using a SRF able to return all the available
information from a given XID value quite useful. And this does not
conflict with what is proposed currently, you would need just to call
the function with XID + number of entries wanted to get a single one.
Comments from other folks about that?

- documented pg_resetxlog changes and make all the pg_resetxlog options
alphabetically ordered
- added WAL logging of the track_commit_timestamp GUC
- added alternative expected output of the regression test so that it works
with make installcheck when track_commit_timestamp is on
- added C interface to set default nodeid for current backend
- several minor comment and naming adjustments mostly suggested by Michael

Thanks for those adjustments.

Then more input about the latest patch:
1) This block is not needed, option -e is listed twice:
    The <option>-o</>, <option>-x</>, <option>-e</>,
-   <option>-m</>, <option>-O</>,
-   and <option>-l</>
+   <option>-m</>, <option>-O</>, <option>-l</>
+   and <option>-e</>
2) Very small thing: a couple of files have no newlines at the end,
among them committs.conf and test_committs/Makefile.
3) pg_last_committed_xact and not pg_last_xact_commit_information or similar?
4) storage.sgml needs to be updated with the new folder pg_committs
5) Er.. node ID is missing in pg_last_committed_xact, no?
6) This XXX notice can be removed:
+       /*
+        * Return empty if the requested value is older than what we have or
+        * newer than newest we have.
+        *
+        * XXX: should this be error instead?
+        */
We are moving toward returning invalid information in the SQL
interface when the information is not in history instead of an error,
no? (Note that I am still a partisan of an error message to let the
caller know that commit info history does not have the information
requested).
7) Note that TransactionTreeSetCommitTsData still never sets do_xlog
at true and that WriteSetTimestampXlogRec never gets called. So no
information is WAL-logged with this patch. Wouldn't that be useful for
standbys as well? Perhaps I am missing why this is disabled? This code
should be activated IMO or it would be just untested.
8) As a more general point, the node ID stuff makes me uncomfortable
and is just added on top of the existing patch without much
thinking... So I am really skeptical about it. The need here is to
pass on demand a int8 that is a node ID that can only be set through a
C interface, so only extensions could play with it. The data passed to
a WAL record is always built and determined by the system and entirely
transparent to the user, inserting user-defined data like that
inconsistent with what we've been doing until now, no?

Also, a question particularly for BDR and Slony folks: do you
sometimes add a new node using the base backup of an existing node :)
See what I come up with: a duplication of this new node ID system with
the already present system ID, no?
Similarly, the LSN is added to the WAL record containing the commit
timestamp, but cannot the LSN of the WAL record containing the commit
timestamp itself be used as a point of reference for a better
ordering? That's not exactly the same as the LSN of the transaction
commit, still it provides a WAL-based reference.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#100)
Re: tracking commit timestamps

On 13/11/14 07:04, Michael Paquier wrote:

On Wed, Nov 12, 2014 at 10:06 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Brief list of changes:
- the commit timestamp record now stores timestamp, lsn and nodeid

Now that not only the commit timestamp is stored, calling that "commit
timestamp", "committs" or "commit_timestamp" is strange, no? If this
patch is moving toward being a more complex information provider,
calling it commit information or commit data is more adapted, no?

It's not, since adding more info will break upgrades, I doubt we will
add more anytime soon. I was thinking about it too tbh, but don't have
better name (I don't like commit data as it seems confusing - isn't
commit data the dml you just committed?).
Maybe commit_metadata, commit_information is probably ok also, this
would need input from others, I am personally fine with keeping the
commit_timestamp name too.

- if the xid passed to get interface is out of range -infinity timestamp is
returned (I think it's bad idea to throw errors here as the valid range is
not static and same ID can start throwing errors between calls
theoretically)

Already mentioned by Jim in a previous mail: this would be better as NULL.

Yeah that make sense, I have no idea what I was thinking :)

- renamed the sql interfaces to pg_xact_commit_timestamp,
pg_xact_commit_timestamp_data and pg_last_committed_xact, they don't expose
the nodeid atm, I personally am not big fan of the "xact" but it seems more
consistent with existing naming

pg_xact_commit_timestamp and pg_xact_commit_timestamp_data are
overlapping. What's wrong with a single function able to return the
whole set (node ID, commit timetamp, commit LSN)? Let's say

That's what pg_xact_commit_timestamp_data does (it does not show nodeid
because we agreed that it should not be exposed yet on sql level). Might
make sense to rename, but let's wait for input about the general
renaming point at the beginning of the mail.

pg_xact_commit_information or pg_xact_commit_data. Already mentioned,
but I also find using a SRF able to return all the available
information from a given XID value quite useful. And this does not
conflict with what is proposed currently, you would need just to call
the function with XID + number of entries wanted to get a single one.
Comments from other folks about that?

No idea what you mean by this to be honest, there is exactly one record
stored for single XID.

Then more input about the latest patch:
1) This block is not needed, option -e is listed twice:
The <option>-o</>, <option>-x</>, <option>-e</>,
-   <option>-m</>, <option>-O</>,
-   and <option>-l</>
+   <option>-m</>, <option>-O</>, <option>-l</>
+   and <option>-e</>
2) Very small thing: a couple of files have no newlines at the end,
among them committs.conf and test_committs/Makefile.
3) pg_last_committed_xact and not pg_last_xact_commit_information or similar?

Just inspiration from DB2's rollforward (which shows among other things
"last committed transaction: <timestamp>"), but I don't feel strongly
about naming so can be changed.

4) storage.sgml needs to be updated with the new folder pg_committs

Right.

5) Er.. node ID is missing in pg_last_committed_xact, no?

That's intentional (for now).

6) This XXX notice can be removed:
+       /*
+        * Return empty if the requested value is older than what we have or
+        * newer than newest we have.
+        *
+        * XXX: should this be error instead?
+        */

Ok.

(Note that I am still a partisan of an error message to let the
caller know that commit info history does not have the information
requested).

IMHO throwing error there would be same as throwing error when WHERE
clause in SELECT does not match anything. As the xid range for which we
store data is dynamic we need to accept any xid as valid input because
the caller has no way of validating if the xid passed is out of range or
not.

7) Note that TransactionTreeSetCommitTsData still never sets do_xlog
at true and that WriteSetTimestampXlogRec never gets called. So no
information is WAL-logged with this patch. Wouldn't that be useful for
standbys as well? Perhaps I am missing why this is disabled? This code
should be activated IMO or it would be just untested.

True is only needed here when you are setting this info to different
transaction than the one you are in since the info can be reconstructed
from normal transaction WAL record (see that it's actually called from
xact_redo_commit_internal, which is how we get the WAL safety and why it
works on slave). So the true is for use by extensions only, it's not
completely uncommon that we have APIs that are used only by extensions.

8) As a more general point, the node ID stuff makes me uncomfortable
and is just added on top of the existing patch without much
thinking... So I am really skeptical about it. The need here is to
pass on demand a int8 that is a node ID that can only be set through a
C interface, so only extensions could play with it. The data passed to
a WAL record is always built and determined by the system and entirely
transparent to the user, inserting user-defined data like that
inconsistent with what we've been doing until now, no?

Again it's not exposed to SQL because I thought there was agreement to
not do that yet since we might want to build some more core stuff on top
of that before exposing it. It's part of the record now because it can
be useful already the way it is and because adding it later would break
pg_upgrade (it's int4 btw).

Also I would really not say it was added without thought, I am one of
the BDR developers and I was before one of the Londiste developers so I
did think about what I would want when in those shoes.

That being said I think I will remove the CommitTsSetDefaultNodeId
interface in next revision, as extension can already set nodeid via
TransactionTreeSetCommitTsData call and we might want to revisit the
CommitTsSetDefaultNodeId stuff once we start implementing the
replication identifiers. Not to mention that I realized in meantime that
CommitTsSetDefaultNodeId the way it's done currently isn't crash safe
(it's not hard to make it crash safe). And since it's quite simple we
can add it at later date easily if needed.

Also, a question particularly for BDR and Slony folks: do you
sometimes add a new node using the base backup of an existing node :)
See what I come up with: a duplication of this new node ID system with
the already present system ID, no?

Yes we do use basebackup sometimes and no it's not possible to use
systemid here:
- the point of nodeid is to be able to store *remote* nodeid as well
as local one (depending where the change actually originated from) so
your local systemid is quite useless there
- systemid is per Postgres instance, you need per-db identifier when
doing logical rep (2 dbs can have single db as destination or the other
way around)

Similarly, the LSN is added to the WAL record containing the commit
timestamp, but cannot the LSN of the WAL record containing the commit
timestamp itself be used as a point of reference for a better
ordering? That's not exactly the same as the LSN of the transaction
commit, still it provides a WAL-based reference.

No, again for several reasons:
- as you pointed out yourself the LSN might not be same as LSN for the xid
- more importantly we normally don't do special WAL logging for commit
timestamp

How it works is that because currently the
TransactionTreeSetCommitTsData is always called with xid of the current
transaction, the WAL record for commit of current transaction can be
used to get the info we need (both timestamp and lsn are used in fact).
As I said above, see how TransactionTreeSetCommitTsData is called from
xact_redo_commit_internal.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Simon Riggs
simon@2ndQuadrant.com
In reply to: Steve Singer (#73)
Re: tracking commit timestamps

On 9 November 2014 16:57, Steve Singer <steve@ssinger.info> wrote:

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I still
want to have some bytes for storing the "origin" or whatever you want to
call it there, as that's the one I personally have biggest use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can pull
some tricks to lower that a bit.

The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim pointed
out earlier in the thread that just ordering on timestamp allows for
multiple transactions with the same timestamp.

I think we need to be clear about the role and function of components here.

Xid timestamps allow a replication system to do post-commit conflict
resolution based upon timestamp, i.e. last update wins. That is
potentially usable by BDR, Slony, xdb and anything else that wants
that.

Ordering transactions in LSN order is very precisly the remit of the
existing logical decoding API. Any user that wishes to see a commits
in sequence can do so using that API. BDR already does this, as do
other users of the decoding API. So Slony already has access to a
useful ordering if it wishes it. We do not need to anything *on this
patch* to enable LSNs for Slony or anyone else. I don't see any reason
to provide the same facility twice, in two different ways.

So in summary... the components are
* Commit LSN order is useful for applying changes - available by
logical decoding
* Commit timestamps and nodeid are useful for conflict resolution -
available from this patch

Both components have been designed in ways that allow multiple
replication systems to use these facilities.

So, -1 to including commit LSN in the SLRU alongside commit timestamp
and nodeid.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Petr Jelinek
petr@2ndquadrant.com
In reply to: Simon Riggs (#102)
Re: tracking commit timestamps

On 13/11/14 14:18, Simon Riggs wrote:

So in summary... the components are
* Commit LSN order is useful for applying changes - available by
logical decoding
* Commit timestamps and nodeid are useful for conflict resolution -
available from this patch

Both components have been designed in ways that allow multiple
replication systems to use these facilities.

So, -1 to including commit LSN in the SLRU alongside commit timestamp
and nodeid.

I am of the same opinion, I added the LSN "by popular demand", but I
still personally don't see the value in having it there as it does *not*
enable us to do something that would be impossible otherwise.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#102)
Re: tracking commit timestamps

On Thu, Nov 13, 2014 at 8:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Ordering transactions in LSN order is very precisly the remit of the
existing logical decoding API. Any user that wishes to see a commits
in sequence can do so using that API. BDR already does this, as do
other users of the decoding API. So Slony already has access to a
useful ordering if it wishes it. We do not need to anything *on this
patch* to enable LSNs for Slony or anyone else. I don't see any reason
to provide the same facility twice, in two different ways.

Perhaps you could respond more specifically to concerns expressed
upthread, like:

/messages/by-id/BLU436-SMTP28B68B9312CBE5D9125441DC870@phx.gbl

I don't see that as a dumb argument on the face of it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#104)
Re: tracking commit timestamps

On 13 November 2014 21:24, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 13, 2014 at 8:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Ordering transactions in LSN order is very precisly the remit of the
existing logical decoding API. Any user that wishes to see a commits
in sequence can do so using that API. BDR already does this, as do
other users of the decoding API. So Slony already has access to a
useful ordering if it wishes it. We do not need to anything *on this
patch* to enable LSNs for Slony or anyone else. I don't see any reason
to provide the same facility twice, in two different ways.

Perhaps you could respond more specifically to concerns expressed
upthread, like:

/messages/by-id/BLU436-SMTP28B68B9312CBE5D9125441DC870@phx.gbl

I don't see that as a dumb argument on the face of it.

We have a clear "must have" argument for timestamps to support
replication conflicts.

Adding LSNs, when we already have a way to access them, is much more
of a nice to have. I don't really see it as a particularly nice to
have either, since the SLRU is in no way ordered.

Scope creep is a dangerous thing, so we shouldn't, and elsewhere
don't, collect up ideas like a bag of mixed sweets. It's easy to
overload things to the point where they don't fly at all. The whole
point of this is that we're building something faster than trigger
based systems.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#105)
Re: tracking commit timestamps

On Thu, Nov 13, 2014 at 6:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 13 November 2014 21:24, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 13, 2014 at 8:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Ordering transactions in LSN order is very precisly the remit of the
existing logical decoding API. Any user that wishes to see a commits
in sequence can do so using that API. BDR already does this, as do
other users of the decoding API. So Slony already has access to a
useful ordering if it wishes it. We do not need to anything *on this
patch* to enable LSNs for Slony or anyone else. I don't see any reason
to provide the same facility twice, in two different ways.

Perhaps you could respond more specifically to concerns expressed
upthread, like:

/messages/by-id/BLU436-SMTP28B68B9312CBE5D9125441DC870@phx.gbl

I don't see that as a dumb argument on the face of it.

We have a clear "must have" argument for timestamps to support
replication conflicts.

Adding LSNs, when we already have a way to access them, is much more
of a nice to have. I don't really see it as a particularly nice to
have either, since the SLRU is in no way ordered.

Scope creep is a dangerous thing, so we shouldn't, and elsewhere
don't, collect up ideas like a bag of mixed sweets. It's easy to
overload things to the point where they don't fly at all. The whole
point of this is that we're building something faster than trigger
based systems.

I think that's slamming the door closed and nailing it shut behind
you. If we add the feature without LSNs, how will someone go back and
add that later? It would change the on-disk format of the SLRU, so to
avoid breaking pg_upgrade, someone would have to write a conversion
utility. Even at that, it would slow pg_upgrade down when the feature
has been used.

By way of contrast, adding the feature now is quite easy. It just
requires storing a few more bytes per transaction.

I am all in favor of incremental development when possible, but not
when it so greatly magnifies the work that needs to be done. People
have been asking for the ability to determine the commit ordering for
years, and this is a way to do that, very inexpensively, as part of a
patch that is needed anyway. We are not talking about loading 20 new
requirements on top of this patch; that would be intolerable. We're
talking about adding one additional piece of information that has been
requested multiple times over the years.

The way I see it, there are at least three uses for this information.
One, trigger-based replication solutions. While logical decoding will
doubtless be preferable, I don't think trigger-based replication
solutions will go away completely, and certainly not right away.
They've wanted this since forever. Two, for user applications that
want to know the commit order for their own purposes, as in Steve's
example. Three, for O(1) snapshots. Heikki's patch to make that
happen seems to have stalled a bit, but if it's ever to go anywhere it
will need something like this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#106)
Re: tracking commit timestamps

On 14 November 2014 17:12, Robert Haas <robertmhaas@gmail.com> wrote:

We are not talking about loading 20 new
requirements on top of this patch; that would be intolerable. We're
talking about adding one additional piece of information that has been
requested multiple times over the years.

The requested information is already available, as discussed. Logical
decoding adds commit ordering for *exactly* the purpose of using it
for replication, available to all solutions. This often requested
feature has now been added and doesn't need to be added twice.

So what we are discussing is adding a completely superfluous piece of
information.

Not including the LSN info does nothing to trigger based replication,
which will no doubt live on happily for many years. But adding LSN
will slow down logical replication, for no purpose at all.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Steve Singer
steve@ssinger.info
In reply to: Simon Riggs (#107)
Re: tracking commit timestamps

On 11/14/2014 08:21 PM, Simon Riggs wrote:

The requested information is already available, as discussed. Logical
decoding adds commit ordering for *exactly* the purpose of using it
for replication, available to all solutions. This often requested
feature has now been added and doesn't need to be added twice.

So what we are discussing is adding a completely superfluous piece of
information.

Not including the LSN info does nothing to trigger based replication,
which will no doubt live on happily for many years. But adding LSN
will slow down logical replication, for no purpose at all.

Simon,
The use cases I'm talking about aren't really replication related. Often
I have come across systems that want to do something such as 'select *
from orders where X > the_last_row_I_saw order by X' and then do further
processing on the order.

This is kind of awkard to do today because you don't have a good
candidate for 'X' to order on. Using either a sequence or insert-row
timestamp doesn't work well because a transaction with a lower value for
X might end up committing after the higher value in in a query result.

Yes you could setup a logical wal slot and listen on the stream of
inserts into your order table but thats a lot of administration overhead
compared to just issuing an SQL query for what really is a query type
operation.

Using the commit timestamp for my X sounded very tempting but could
allow duplicates.

One could argue that this patch is about replication features, and
providing commit ordering for query purposes should be a separate patch
to add that on top of this infrastructure. I see merit to smaller more
focused patches but that requires leaving the door open to easily
extending things later.

It could also be that I'm the only one who wants to order and filter
queries in this manner (but that would surprise me). If the commit lsn
has limited appeal and we decide we don't want it at all then we
shouldn't add it. I've seen this type of requirement in a number of
different systems at a number of different companies. I've generally
seen it dealt with by either selecting rows behind the last now()
timestamp seen and then filtering out already processed rows or by
tracking the 'processed' state of each row individually (ie performing
an update on each row once its been processed) which performs poorly.

Steve

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Simon Riggs
simon@2ndQuadrant.com
In reply to: Steve Singer (#108)
Re: tracking commit timestamps

On 15 November 2014 04:32, Steve Singer <steve@ssinger.info> wrote:

The use cases I'm talking about aren't really replication related. Often I
have come across systems that want to do something such as 'select * from
orders where X > the_last_row_I_saw order by X' and then do further
processing on the order.

Yes, existing facilities provide mechanisms for different types of
application change queues.

If you want to write a processing queue in SQL, that isn't the best
way. You'll need some way to keep track of whether or not its been
successfully processed. That's either a column in the table, or a
column in a queue table maintained by triggers, with the row write
locked on read. You can then have multiple readers from this queue
using the new SKIP LOCKED feature, which was specifically designed to
facilitate that.

Logical decoding was intended for much more than just replication. It
provides commit order access to changed data in a form that is both
usable and efficient for high volume applicatiion needs.

I don't see any reason to add LSN into a SLRU updated at commit to
support those application needs.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Petr Jelinek
petr@2ndquadrant.com
In reply to: Simon Riggs (#109)
Re: tracking commit timestamps

On 15/11/14 13:36, Simon Riggs wrote:

On 15 November 2014 04:32, Steve Singer <steve@ssinger.info> wrote:

The use cases I'm talking about aren't really replication related. Often I
have come across systems that want to do something such as 'select * from
orders where X > the_last_row_I_saw order by X' and then do further
processing on the order.

Yes, existing facilities provide mechanisms for different types of
application change queues.

If you want to write a processing queue in SQL, that isn't the best
way. You'll need some way to keep track of whether or not its been
successfully processed. That's either a column in the table, or a
column in a queue table maintained by triggers, with the row write
locked on read. You can then have multiple readers from this queue
using the new SKIP LOCKED feature, which was specifically designed to
facilitate that.

Logical decoding was intended for much more than just replication. It
provides commit order access to changed data in a form that is both
usable and efficient for high volume applicatiion needs.

I don't see any reason to add LSN into a SLRU updated at commit to
support those application needs.

I am still on the fence about the LSN issue, I don't mind it from code
perspective, it's already written anyway, but I am not sure if we really
want it in the SLRU as Simon says.

Mainly because of three things:
One, this patch is not really feature patch, as you can do most of what
it does via tables already, but more a performance improvement and we
should try to make it perform as good as possible then, adding more
things does not really improve performance (according to my benchmarks
the performance difference with/without LSN is under 1% so it's not
terrible, but it's there), not to mention additional disk space.

Two, the LSN use-cases seem to still be only theoretical to me, while
the timestamp use-case has been production problem for at least a decade.

Three, even if we add LSN, I am still not convinced that the use-cases
presented here wouldn't be better served by putting that info into
actual table instead of SLRU - as people want to use it as filter in
WHERE clause, somebody mentioned exporting to different db, etc.

Maybe we need better explanation of the LSN use-case(s) to understand
why it should be stored here and why the other solutions are
significantly worse.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Simon Riggs
simon@2ndQuadrant.com
In reply to: Petr Jelinek (#110)
Re: tracking commit timestamps

On 19 November 2014 02:12, Petr Jelinek <petr@2ndquadrant.com> wrote:

Maybe we need better explanation of the LSN use-case(s) to understand why it
should be stored here and why the other solutions are significantly worse.

We should apply the same standard that has been applied elsewhere. If
someone can show some software that could actually make use of LSN and
there isn't a better way, then we can include it.

PostgreSQL isn't a place where we speculate about possible future needs.

I don't see why it should take 2+ years of prototypes, designs and
discussions to get something in from BDR, but then we simply wave a
hand and include something else at last minute without careful
thought. Even if that means that later additions might need to think
harder about upgrades.

Timestamp and nodeid are useful for a variety of cases; LSN doesn't
meet the same standard and should not be included now.

We still have many months before even beta for people that want LSN to
make a *separate* case for its inclusion as a separate feature.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Petr Jelinek
petr@2ndquadrant.com
In reply to: Simon Riggs (#111)
Re: tracking commit timestamps

On 19/11/14 12:20, Simon Riggs wrote:

On 19 November 2014 02:12, Petr Jelinek <petr@2ndquadrant.com> wrote:

Maybe we need better explanation of the LSN use-case(s) to understand why it
should be stored here and why the other solutions are significantly worse.

We should apply the same standard that has been applied elsewhere. If
someone can show some software that could actually make use of LSN and
there isn't a better way, then we can include it.

...

We still have many months before even beta for people that want LSN to
make a *separate* case for its inclusion as a separate feature.

This is good point, we are not too late in the cycle that LSN couldn't
be added later if we find that it is indeed needed (and we don't have to
care about pg_upgrade until beta).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#112)
Re: tracking commit timestamps

Petr Jelinek wrote:

This is good point, we are not too late in the cycle that LSN couldn't be
added later if we find that it is indeed needed (and we don't have to care
about pg_upgrade until beta).

I think we're overblowing the pg_upgrade issue. Surely we don't need to
preserve commit_ts data when upgrading across major versions; and
pg_upgrade is perfectly prepared to remove old data when upgrading
(actually it just doesn't copy it; consider pg_subtrans or pg_serial,
for instance.) If we need to change binary representation in a future
major release, we can do so without any trouble.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#113)
Re: tracking commit timestamps

On Wed, Nov 19, 2014 at 8:22 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Petr Jelinek wrote:

This is good point, we are not too late in the cycle that LSN couldn't be
added later if we find that it is indeed needed (and we don't have to care
about pg_upgrade until beta).

I think we're overblowing the pg_upgrade issue. Surely we don't need to
preserve commit_ts data when upgrading across major versions; and
pg_upgrade is perfectly prepared to remove old data when upgrading
(actually it just doesn't copy it; consider pg_subtrans or pg_serial,
for instance.) If we need to change binary representation in a future
major release, we can do so without any trouble.

Actually, that's a good point. I still don't understand what the
resistance is to add something quite inexpensive that multiple people
obviously want, but at least if we don't, we still have the option to
change it later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Steve Singer
steve@ssinger.info
In reply to: Alvaro Herrera (#113)
Re: tracking commit timestamps

On 11/19/2014 08:22 AM, Alvaro Herrera wrote:

I think we're overblowing the pg_upgrade issue. Surely we don't need to
preserve commit_ts data when upgrading across major versions; and
pg_upgrade is perfectly prepared to remove old data when upgrading
(actually it just doesn't copy it; consider pg_subtrans or pg_serial,
for instance.) If we need to change binary representation in a future
major release, we can do so without any trouble.

That sounds reasonable. I am okay with Petr removing the LSN portion
this patch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116Petr Jelinek
petr@2ndquadrant.com
In reply to: Steve Singer (#115)
1 attachment(s)
Re: tracking commit timestamps

On 19/11/14 17:30, Steve Singer wrote:

On 11/19/2014 08:22 AM, Alvaro Herrera wrote:

I think we're overblowing the pg_upgrade issue. Surely we don't need to
preserve commit_ts data when upgrading across major versions; and
pg_upgrade is perfectly prepared to remove old data when upgrading
(actually it just doesn't copy it; consider pg_subtrans or pg_serial,
for instance.) If we need to change binary representation in a future
major release, we can do so without any trouble.

That sounds reasonable. I am okay with Petr removing the LSN portion
this patch.

I did that then, v9 attached with following changes:
- removed lsn info (obviously)

- the pg_xact_commit_timestamp and pg_last_committed_xact return NULLs
when timestamp data was not found

- made the default nodeid crash safe - this also makes use of the
do_xlog parameter in TransactionTreeSetCommitTsData if nodeid is set,
although that still does not happen without extension actually using the API

- added some more regression tests

- some small comment and docs adjustments based on Michael's last email

I didn't change the pg_last_committed_xact function name and I didn't
make nodeid visible from SQL level interfaces and don't plan to in this
patch as I think it's very premature to do so before we have some C code
using it.

Just to explain once more and hopefully more clearly how the crash
safety/WAL logging is handled since there was some confusion in last review:
We only do WAL logging when nodeid is also logged (when nodeid is not 0)
because the timestamp itself can be read from WAL record of transaction
commit so it's pointless to log another WAL record just to store the
timestamp again.
Extension can either set default nodeid which is then logged
transparently, or can override the default logging mechanism by calling
TransactionTreeSetCommitTsData with whatever data it wants and do_xlog
set to true which will then write the WAL record with this overriding
information.
During WAL replay the commit timestamp is set from transaction commit
record and then if committs WAL record is found it's used to overwrite
the commit timestamp/nodeid for given xid.

Also, there is exactly one record in SLRU for each xid so there is no
point in making the SQL interfaces return multiple results.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v9.patchtext/x-diff; name=committs-v9.patchDownload
diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..e331297 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,6 +50,7 @@ SUBDIRS = \
 		spi		\
 		tablefunc	\
 		tcn		\
+		test_committs	\
 		test_decoding	\
 		test_parser	\
 		test_shm_mq	\
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..e0af3cf 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include "access/brin_xlog.h"
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 0000000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	    --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	    --temp-install=./tmp_check \
+	    --extra-install=contrib/test_committs \
+	    --outputdir=./regression_output \
+	    $(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	    --extra-install=contrib/test_committs \
+	    $(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 0000000..69465f3
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,33 @@
+--
+-- Commit Timestamp
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ pg_xact_commit_timestamp 
+--------------------------
+ 
+(1 row)
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ ?column? | ?column? | ?column? 
+----------+----------+----------
+ t        | t        | t
+(1 row)
+
diff --git a/contrib/test_committs/sql/committs_on.sql b/contrib/test_committs/sql/committs_on.sql
new file mode 100644
index 0000000..a4a44d2
--- /dev/null
+++ b/contrib/test_committs/sql/committs_on.sql
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
+
+SELECT pg_xact_commit_timestamp('0'::xid);
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bfb7bb..2fef80e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,20 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions. This parameter
+        can only be set in <filename>postgresql.conf</> file or on the server
+        command line. The default value is <literal>off</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b58cfa5..5d527d0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15918,6 +15918,38 @@ SELECT collation for ('foo' COLLATE "de_DE");
     For example <literal>10:20:10,14,15</literal> means
     <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
    </para>
+
+   <para>
+    The functions shown in <xref linkend="functions-committs">
+    provide information about transactions that have been already committed.
+    These functions mainly provide information about when the transactions
+    were committed. They only provide useful data when
+    <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+    and only for transactions that were committed after it was enabled.
+   </para>
+
+   <table id="functions-committs">
+    <title>Committed transaction information</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry><literal><function>pg_xact_commit_timestamp(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>get commit timestamp of a transaction</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_last_committed_xact()</function></literal></entry>
+       <entry><parameter>xid</> <type>xid</>, <parameter>timestamp</> <type>timestamp with time zone</></entry>
+       <entry>get transaction Id and commit timestamp of latest transaction commit</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   </sect1>
 
   <sect1 id="functions-admin">
diff --git a/doc/src/sgml/ref/pg_resetxlog.sgml b/doc/src/sgml/ref/pg_resetxlog.sgml
index aba7185..3c3e658 100644
--- a/doc/src/sgml/ref/pg_resetxlog.sgml
+++ b/doc/src/sgml/ref/pg_resetxlog.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  <refsynopsisdiv>
   <cmdsynopsis>
    <command>pg_resetxlog</command>
+   <arg choice="opt"><option>-c</option> <replaceable class="parameter">xid</replaceable></arg>
    <arg choice="opt"><option>-f</option></arg>
    <arg choice="opt"><option>-n</option></arg>
    <arg choice="opt"><option>-o</option> <replaceable class="parameter">oid</replaceable></arg>
@@ -77,12 +78,12 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The <option>-o</>, <option>-x</>, <option>-e</>,
-   <option>-m</>, <option>-O</>,
-   and <option>-l</>
+   The <option>-o</>, <option>-x</>, <option>-m</>, <option>-O</>,
+   <option>-l</> and <option>-e</>
    options allow the next OID, next transaction ID, next transaction ID's
-   epoch, next and oldest multitransaction ID, next multitransaction offset, and WAL
-   starting address values to be set manually.  These are only needed when
+   epoch, next and oldest multitransaction ID, next multitransaction offset, WAL
+   starting address and the oldest transaction ID for which the commit time can
+   be retrieved values to be set manually.  These are only needed when
    <command>pg_resetxlog</command> is unable to determine appropriate values
    by reading <filename>pg_control</>.  Safe values can be determined as
    follows:
@@ -130,6 +131,15 @@ PostgreSQL documentation
 
     <listitem>
      <para>
+      A safe value for the oldest transaction ID for which the commit time can
+      be retrieve (<option>-c</>) can be determined by looking for the
+      numerically smallest file name in the directory <filename>pg_committs</>
+      under the data directory As above, the file names are in hexadecimal.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
       The WAL starting address (<option>-l</>) should be
       larger than any WAL segment file name currently existing in
       the directory <filename>pg_xlog</> under the data directory.
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 920b5f0..cb76b98 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -67,6 +67,11 @@ Item
 </row>
 
 <row>
+ <entry><filename>pg_commit_ts</></entry>
+ <entry>Subdirectory containing transaction commit timestamp data</entry>
+</row>
+
+<row>
  <entry><filename>pg_clog</></entry>
  <entry>Subdirectory containing transaction commit status data</entry>
 </row>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 32cb985..0daa9bb 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,8 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
-	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
+	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..7802584
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "utils/timestamp.h"
+
+
+void
+commit_ts_desc(StringInfo buf, XLogRecord *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "zeropage: %d", pageno);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "truncate before: %d", pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *xlrec = (xl_commit_ts_set *) rec;
+		int		i;
+
+		appendStringInfo(buf, "set commit_ts %s for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->mainxid);
+		for (i = 0; i < xlrec->nsubxids; i++)
+			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+	}
+}
+
+const char *
+commit_ts_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case COMMIT_TS_ZEROPAGE:
+			id = "ZEROPAGE";
+			break;
+		case COMMIT_TS_TRUNCATE:
+			id = "TRUNCATE";
+			break;
+		case COMMIT_TS_SETTS:
+			id = "SETTS";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e0957ff..9919c52 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 		appendStringInfo(buf, "redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest commit timestamp xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogRecord *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 82a6c76..a1979ca 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o committs.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/committs.c b/src/backend/access/transam/committs.c
new file mode 100644
index 0000000..6d767d5
--- /dev/null
+++ b/src/backend/access/transam/committs.c
@@ -0,0 +1,855 @@
+/*-------------------------------------------------------------------------
+ *
+ * committs.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/committs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/* We need 8+4 bytes per xact */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	NodeIdRec		nodeid;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
+									sizeof(NodeIdRec))
+
+/* this is limited by how much data we can fit into SLRU cache */
+#define COMMIT_TS_MIN_BLCKSZ 2048
+
+#define COMMIT_TS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CommitTs control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variable */
+bool	track_commit_timestamp;
+
+NodeIdRec CommitTsDefaultNodeId = InvalidNodeId;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 NodeIdRec nodeid, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						  NodeIdRec nodeid, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 NodeIdRec nodeid);
+
+
+/*
+ * CommitTsSetDefaultNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+void
+CommitTsSetDefaultNodeId(NodeIdRec nodeid)
+{
+	CommitTsDefaultNodeId = nodeid;
+}
+
+/*
+ * CommitTsGetDefaultNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+NodeIdRec
+CommitTsGetDefaultNodeId(void)
+{
+	return CommitTsDefaultNodeId;
+}
+
+/*
+ * TransactionTreeSetCommitTsData
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ * The reason why tracking just the parent xid committs is not enough is that
+ * the subtrans SLRU does not stay valid across crashes (is not permanent) so we
+ * need to keep the information about them here. If the subtrans implementation
+ * changes in the future, we might want to revisit the decision of storing
+ * committs for each subxid.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+							   TransactionId *subxids, TimestampTz timestamp,
+							   NodeIdRec nodeid, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, nodeid);
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, nodeid,
+							 pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.nodeid = nodeid;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 NodeIdRec nodeid, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, ts, nodeid, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], ts, nodeid, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						 NodeIdRec nodeid, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry entry;
+
+	entry.time = ts;
+	entry.nodeid = nodeid;
+
+	memcpy(CommitTsCtl->shared->page_buffer[slotno] +
+				SizeOfCommitTimestampEntry * entryno,
+		   &entry, SizeOfCommitTimestampEntry);
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ */
+void
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 NodeIdRec *nodeid)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry entry;
+	TransactionId oldestCommitTs;
+
+	/* Error if module not enabled */
+	if (!track_commit_timestamp)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Cannot get commit timestamp data because \"track_commit_timestamp\" is not enabled")));
+	}
+
+	/*
+	 * Return empty if the requested value is older than what we have or
+	 * newer than newest we have.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs) ||
+		TransactionIdPrecedes(commitTsShared->xidLastCommit, xid))
+	{
+		if (ts)
+			TIMESTAMP_NOBEGIN(*ts);
+		if (nodeid)
+			*nodeid = InvalidNodeId;
+		return;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory;
+	 * if it's a hit, acquire a lock and read the data, after verifying
+	 * that it's still what we initially read.  Otherwise, fall through
+	 * to read from SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (nodeid)
+				*nodeid = commitTsShared->dataLastCommit.nodeid;
+			LWLockRelease(CommitTsLock);
+			return;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	memcpy(&entry,
+		   CommitTsCtl->shared->page_buffer[slotno] +
+				SizeOfCommitTimestampEntry * entryno,
+		   SizeOfCommitTimestampEntry);
+
+	if (ts)
+		*ts = entry.time;
+	if (nodeid)
+		*nodeid = entry.nodeid;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTsData(TimestampTz *ts, NodeIdRec *nodeid)
+{
+	TransactionId	xid;
+
+	/* Return empty if module not enabled */
+	if (!track_commit_timestamp)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Cannot get commit timestamp data because \"track_commit_timestamp\" is not enabled")));
+	}
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (nodeid)
+		*nodeid = commitTsShared->dataLastCommit.nodeid;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+Datum
+pg_xact_commit_timestamp(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		ts;
+
+	TransactionIdGetCommitTsData(xid, &ts, NULL);
+
+	if (TIMESTAMP_IS_NOBEGIN(ts))
+		PG_RETURN_NULL();
+
+	PG_RETURN_TIMESTAMPTZ(ts);
+}
+
+
+Datum
+pg_last_committed_xact(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		ts;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTsData(&ts, NULL);
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	if (xid == InvalidTransactionId)
+	{
+		memset(nulls, true, sizeof(nulls));
+	}
+	else
+	{
+		values[0] = TransactionIdGetDatum(xid);
+		nulls[0] = false;
+
+		values[1] = TimestampTzGetDatum(ts);
+		nulls[1] = false;
+	}
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Initialization of shared memory for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_commit_ts");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
+		commitTsShared->dataLastCommit.nodeid = InvalidNodeId;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.
+ * Must be called after recovery has finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+InitCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!track_commit_timestamp)
+	{
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsControlLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest value to the next Xid; that way, we will not try
+	 * to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+	}
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ *
+ * NB2: the current implementation relies on the fact that
+ * track_commit_timestamp is flagged as PGC_POSTMASTER
+ * (only possible to be set at server start).
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the earliest value for which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_ZEROPAGE, &rdata);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogRecData rdata;
+
+	rdata.data = (char *) (&pageno);
+	rdata.len = sizeof(int);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_TRUNCATE, &rdata);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 NodeIdRec nodeid)
+{
+	XLogRecData			rdata;
+	xl_commit_ts_set	record;
+
+	record.timestamp = timestamp;
+	record.nodeid = nodeid;
+	record.mainxid = mainxid;
+	record.nsubxids = nsubxids;
+	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+
+	rdata.data = (char *) &record;
+	rdata.len = offsetof(xl_commit_ts_set, subxids) +
+		nsubxids * sizeof(TransactionId);
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_SETTS, &rdata);
+}
+
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+commit_ts_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in commit_ts records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *setts = (xl_commit_ts_set *) XLogRecGetData(record);
+
+		TransactionTreeSetCommitTsData(setts->mainxid, setts->nsubxids,
+									   setts->subxids, setts->timestamp,
+									   setts->nodeid, false);
+	}
+	else
+		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
+}
+
+/*
+ * Helper function for GUC
+ *
+ * Check if we can enable the track_commit_timestamp.
+ */
+bool
+check_track_commit_timestamp(bool *newval, void **extra, GucSource source)
+{
+	if (*newval && BLCKSZ < COMMIT_TS_MIN_BLCKSZ)
+	{
+		GUC_check_errmsg("Commit timestamps tacking cannot be enabled for builds with page size smaller than %d",
+						 COMMIT_TS_MIN_BLCKSZ);
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index befd60f..f24861c 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index d51cca4..d3287da 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -158,9 +159,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_committs too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6f92bad..4670791 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1168,6 +1169,20 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We only need to log the commit timestamp separately if the nodeid
+	 * is not InvalidNodeId since the commit record logged above already
+	 * contains the timestamp info and will be used to load it.
+	 */
+	if (markXidCommitted)
+	{
+		NodeIdRec nodeid = CommitTsGetDefaultNodeId();
+
+		TransactionTreeSetCommitTsData(xid, nchildren, children,
+									   xactStopTimestamp,
+									   nodeid, nodeid != InvalidNodeId);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4683,6 +4698,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4710,6 +4726,10 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit timestamp and metadata */
+	TransactionTreeSetCommitTsData(xid, nsubxacts, sub_xids,
+								   commit_time, InvalidNodeId, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4829,7 +4849,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4844,7 +4865,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 99f702c..02b1dca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4520,6 +4521,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4529,6 +4531,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -4602,6 +4605,7 @@ BootStrapXLOG(void)
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
+	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
@@ -4610,6 +4614,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -5858,6 +5863,9 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("oldest commit timestamp Xid: %u",
+					checkPoint.oldestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -5869,6 +5877,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6091,11 +6100,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans
+			 * only. MultiXact has already been started up and other SLRUs are
+			 * not maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -6742,12 +6752,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -6783,6 +6794,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enables, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	InitCommitTs();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7347,6 +7364,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -7674,6 +7692,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -7959,6 +7981,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
@@ -8399,7 +8422,8 @@ XLogReportParameters(void)
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
-		max_locks_per_xact != ControlFile->max_locks_per_xact)
+		max_locks_per_xact != ControlFile->max_locks_per_xact ||
+		track_commit_timestamp != ControlFile->track_commit_timestamp)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -8420,6 +8444,7 @@ XLogReportParameters(void)
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
+			xlrec.track_commit_timestamp = track_commit_timestamp;
 
 			rdata.buffer = InvalidBuffer;
 			rdata.data = (char *) &xlrec;
@@ -8436,6 +8461,7 @@ XLogReportParameters(void)
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 		UpdateControlFile();
 	}
 }
@@ -8815,6 +8841,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6384dc7..23b5248 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1071,10 +1072,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG to the oldest computed value.  Note we don't truncate
-	 * multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG and CommitTs to the oldest computed value.
+	 * Note we don't truncate multixacts; that will be done by the next
+	 * checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1084,6 +1087,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	SetCommitTsLimit(frozenXID);
 }
 
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8e78aaf..44898ab 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -133,6 +133,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
+		case RM_COMMIT_TS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..9025601 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 719181c..4b4b4bf 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* committs.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index db65c76..df6c952 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -826,6 +827,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&track_commit_timestamp,
+		false,
+		check_track_commit_timestamp, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6e8ea1e..4da89a6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -227,6 +227,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 0	# max number of replication slots
+#track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
 
 # - Master Server -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index dc1f1df..28e6dfd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -185,6 +185,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index b2e0793..a838bb5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -270,6 +270,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
@@ -300,6 +302,8 @@ main(int argc, char *argv[])
 		   ControlFile.max_prepared_xacts);
 	printf(_("Current max_locks_per_xact setting:   %d\n"),
 		   ControlFile.max_locks_per_xact);
+	printf(_("Current track_commit_timestamp setting: %s\n"),
+		   ControlFile.track_commit_timestamp ? _("on") : _("off"));
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 2ba9946..a6bd8d5 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -63,6 +63,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_commit_ts = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -112,7 +113,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "c:D:e:fl:m:no:O:x:")) != -1)
 	{
 		switch (c)
 		{
@@ -158,6 +159,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_commit_ts = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_commit_ts == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -345,6 +361,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_commit_ts != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_commit_ts;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -539,6 +558,7 @@ GuessControlValues(void)
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -621,6 +641,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -702,6 +724,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_commit_ts != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -739,6 +767,7 @@ RewriteControlFile(void)
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -1095,6 +1124,7 @@ usage(void)
 	printf(_("%s resets the PostgreSQL transaction log.\n\n"), progname);
 	printf(_("Usage:\n  %s [OPTION]... {[-D] DATADIR}\n\n"), progname);
 	printf(_("Options:\n"));
+	printf(_("  -c XID           set the oldest transaction with retrievable commit timestamp\n"));
 	printf(_("  -e XIDEPOCH      set next transaction ID epoch\n"));
 	printf(_("  -f               force update to be done\n"));
 	printf(_("  -l XLOGFILE      force minimum WAL starting location for new transaction log\n"));
diff --git a/src/include/access/committs.h b/src/include/access/committs.h
new file mode 100644
index 0000000..9ca7559
--- /dev/null
+++ b/src/include/access/committs.h
@@ -0,0 +1,70 @@
+/*
+ * committs.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/committs.h
+ */
+#ifndef COMMITTS_H
+#define COMMITTS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+#include "utils/guc.h"
+
+extern PGDLLIMPORT bool	track_commit_timestamp;
+extern bool check_track_commit_timestamp(bool *newval, void **extra,
+										 GucSource source);
+
+typedef uint32 NodeIdRec;
+
+#define InvalidNodeId 0
+
+extern void CommitTsSetDefaultNodeId(NodeIdRec nodeid);
+extern NodeIdRec CommitTsGetDefaultNodeId(void);
+extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+										   TransactionId *subxids,
+										   TimestampTz timestamp,
+										   NodeIdRec nodeid,
+										   bool do_xlog);
+extern void TransactionIdGetCommitTsData(TransactionId xid,
+										 TimestampTz *ts,
+										 NodeIdRec *nodeid);
+extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
+										   NodeIdRec *nodeid);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void InitCommitTs(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMIT_TS_ZEROPAGE		0x00
+#define COMMIT_TS_TRUNCATE		0x10
+#define COMMIT_TS_SETTS			0x20
+
+typedef struct xl_commit_ts_set
+{
+	TimestampTz		timestamp;
+	NodeIdRec		nodeid;
+	TransactionId	mainxid;
+	int				nsubxids;
+	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+} xl_commit_ts_set;
+
+
+extern void commit_ts_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void commit_ts_desc(StringInfo buf, XLogRecord *record);
+extern const char *commit_ts_identify(uint8 info);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 76a6421..27168c3 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,7 +24,7 @@
  * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, startup, cleanup */
+/* symbol name, textual name, redo, desc, identify, startup, cleanup */
 PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
 PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
 PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
@@ -43,3 +43,4 @@ PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_start
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..b59fd98 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,11 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsControlLock
+	 */
+	TransactionId oldestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 19b2ef8..56203b9 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -186,6 +186,7 @@ typedef struct xl_parameter_change
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
+	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ba79d25..70afbd1 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,7 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
@@ -176,6 +177,7 @@ typedef struct ControlFileData
 	int			max_worker_processes;
 	int			max_prepared_xacts;
 	int			max_locks_per_xact;
+	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5d4e889..da93201 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3017,6 +3017,12 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3581 ( pg_xact_commit_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_xact_commit_timestamp _null_ _null_ _null_ ));
+DESCR("get commit timestamp of a transaction");
+
+DATA(insert OID = 3583 ( pg_last_committed_xact PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184}" "{o,o}" "{xid,timestamp}" _null_ pg_last_committed_xact _null_ _null_ _null_ ));
+DESCR("get transaction Id and commit timestamp of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 91cab87..09654a8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 3ba34f8..519ea7e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1182,6 +1182,10 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_xact_commit_timestamp(PG_FUNCTION_ARGS);
+extern Datum pg_last_committed_xact(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/committs.out b/src/test/regress/expected/committs.out
new file mode 100644
index 0000000..cb1ea46
--- /dev/null
+++ b/src/test/regress/expected/committs.out
@@ -0,0 +1,25 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ off
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
diff --git a/src/test/regress/expected/committs_1.out b/src/test/regress/expected/committs_1.out
new file mode 100644
index 0000000..c1d24c5
--- /dev/null
+++ b/src/test/regress/expected/committs_1.out
@@ -0,0 +1,39 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ on
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ pg_xact_commit_timestamp 
+--------------------------
+ 
+(1 row)
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ ?column? | ?column? | ?column? 
+----------+----------+----------
+ t        | t        | t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d4f02e5..ec0a7c9 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,7 +88,7 @@ test: brin privileges security_label collate matview lock replica_identity rowse
 # ----------
 # Another group of parallel tests
 # ----------
-test: alter_generic misc psql async
+test: alter_generic misc psql async committs
 
 # rules cannot run concurrently with any test that creates a view
 test: rules
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 611b0a8..b0c4f39 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -148,3 +148,4 @@ test: largeobject
 test: with
 test: xml
 test: stats
+test: committs
diff --git a/src/test/regress/sql/committs.sql b/src/test/regress/sql/committs.sql
new file mode 100644
index 0000000..a72705d
--- /dev/null
+++ b/src/test/regress/sql/committs.sql
@@ -0,0 +1,23 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
+
+SELECT pg_xact_commit_timestamp('0'::xid);
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
#117Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#116)
1 attachment(s)
Re: tracking commit timestamps

On 21/11/14 00:17, Petr Jelinek wrote:

On 19/11/14 17:30, Steve Singer wrote:

On 11/19/2014 08:22 AM, Alvaro Herrera wrote:

I think we're overblowing the pg_upgrade issue. Surely we don't need to
preserve commit_ts data when upgrading across major versions; and
pg_upgrade is perfectly prepared to remove old data when upgrading
(actually it just doesn't copy it; consider pg_subtrans or pg_serial,
for instance.) If we need to change binary representation in a future
major release, we can do so without any trouble.

That sounds reasonable. I am okay with Petr removing the LSN portion
this patch.

I did that then, v9 attached with following changes:
- removed lsn info (obviously)

- the pg_xact_commit_timestamp and pg_last_committed_xact return NULLs
when timestamp data was not found

- made the default nodeid crash safe - this also makes use of the
do_xlog parameter in TransactionTreeSetCommitTsData if nodeid is set,
although that still does not happen without extension actually using the
API

- added some more regression tests

- some small comment and docs adjustments based on Michael's last email

I didn't change the pg_last_committed_xact function name and I didn't
make nodeid visible from SQL level interfaces and don't plan to in this
patch as I think it's very premature to do so before we have some C code
using it.

Just to explain once more and hopefully more clearly how the crash
safety/WAL logging is handled since there was some confusion in last
review:
We only do WAL logging when nodeid is also logged (when nodeid is not 0)
because the timestamp itself can be read from WAL record of transaction
commit so it's pointless to log another WAL record just to store the
timestamp again.
Extension can either set default nodeid which is then logged
transparently, or can override the default logging mechanism by calling
TransactionTreeSetCommitTsData with whatever data it wants and do_xlog
set to true which will then write the WAL record with this overriding
information.
During WAL replay the commit timestamp is set from transaction commit
record and then if committs WAL record is found it's used to overwrite
the commit timestamp/nodeid for given xid.

Also, there is exactly one record in SLRU for each xid so there is no
point in making the SQL interfaces return multiple results.

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v10.patchtext/x-diff; name=committs-v10.patchDownload
diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..e331297 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,6 +50,7 @@ SUBDIRS = \
 		spi		\
 		tablefunc	\
 		tcn		\
+		test_committs	\
 		test_decoding	\
 		test_parser	\
 		test_shm_mq	\
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..e0af3cf 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include "access/brin_xlog.h"
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 0000000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	    --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	    --temp-install=./tmp_check \
+	    --extra-install=contrib/test_committs \
+	    --outputdir=./regression_output \
+	    $(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	    --extra-install=contrib/test_committs \
+	    $(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 0000000..69465f3
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,33 @@
+--
+-- Commit Timestamp
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ pg_xact_commit_timestamp 
+--------------------------
+ 
+(1 row)
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ ?column? | ?column? | ?column? 
+----------+----------+----------
+ t        | t        | t
+(1 row)
+
diff --git a/contrib/test_committs/sql/committs_on.sql b/contrib/test_committs/sql/committs_on.sql
new file mode 100644
index 0000000..a4a44d2
--- /dev/null
+++ b/contrib/test_committs/sql/committs_on.sql
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
+
+SELECT pg_xact_commit_timestamp('0'::xid);
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ab8c263..e3713d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,20 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions. This parameter
+        can only be set in <filename>postgresql.conf</> file or on the server
+        command line. The default value is <literal>off</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 90a3460..83a7fb7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15923,6 +15923,38 @@ SELECT collation for ('foo' COLLATE "de_DE");
     For example <literal>10:20:10,14,15</literal> means
     <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
    </para>
+
+   <para>
+    The functions shown in <xref linkend="functions-committs">
+    provide information about transactions that have been already committed.
+    These functions mainly provide information about when the transactions
+    were committed. They only provide useful data when
+    <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+    and only for transactions that were committed after it was enabled.
+   </para>
+
+   <table id="functions-committs">
+    <title>Committed transaction information</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry><literal><function>pg_xact_commit_timestamp(<parameter>xid</parameter>)</function></literal></entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>get commit timestamp of a transaction</entry>
+      </row>
+      <row>
+       <entry><literal><function>pg_last_committed_xact()</function></literal></entry>
+       <entry><parameter>xid</> <type>xid</>, <parameter>timestamp</> <type>timestamp with time zone</></entry>
+       <entry>get transaction Id and commit timestamp of latest transaction commit</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   </sect1>
 
   <sect1 id="functions-admin">
diff --git a/doc/src/sgml/ref/pg_resetxlog.sgml b/doc/src/sgml/ref/pg_resetxlog.sgml
index aba7185..3c3e658 100644
--- a/doc/src/sgml/ref/pg_resetxlog.sgml
+++ b/doc/src/sgml/ref/pg_resetxlog.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  <refsynopsisdiv>
   <cmdsynopsis>
    <command>pg_resetxlog</command>
+   <arg choice="opt"><option>-c</option> <replaceable class="parameter">xid</replaceable></arg>
    <arg choice="opt"><option>-f</option></arg>
    <arg choice="opt"><option>-n</option></arg>
    <arg choice="opt"><option>-o</option> <replaceable class="parameter">oid</replaceable></arg>
@@ -77,12 +78,12 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The <option>-o</>, <option>-x</>, <option>-e</>,
-   <option>-m</>, <option>-O</>,
-   and <option>-l</>
+   The <option>-o</>, <option>-x</>, <option>-m</>, <option>-O</>,
+   <option>-l</> and <option>-e</>
    options allow the next OID, next transaction ID, next transaction ID's
-   epoch, next and oldest multitransaction ID, next multitransaction offset, and WAL
-   starting address values to be set manually.  These are only needed when
+   epoch, next and oldest multitransaction ID, next multitransaction offset, WAL
+   starting address and the oldest transaction ID for which the commit time can
+   be retrieved values to be set manually.  These are only needed when
    <command>pg_resetxlog</command> is unable to determine appropriate values
    by reading <filename>pg_control</>.  Safe values can be determined as
    follows:
@@ -130,6 +131,15 @@ PostgreSQL documentation
 
     <listitem>
      <para>
+      A safe value for the oldest transaction ID for which the commit time can
+      be retrieve (<option>-c</>) can be determined by looking for the
+      numerically smallest file name in the directory <filename>pg_committs</>
+      under the data directory As above, the file names are in hexadecimal.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
       The WAL starting address (<option>-l</>) should be
       larger than any WAL segment file name currently existing in
       the directory <filename>pg_xlog</> under the data directory.
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 920b5f0..cb76b98 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -67,6 +67,11 @@ Item
 </row>
 
 <row>
+ <entry><filename>pg_commit_ts</></entry>
+ <entry>Subdirectory containing transaction commit timestamp data</entry>
+</row>
+
+<row>
  <entry><filename>pg_clog</></entry>
  <entry>Subdirectory containing transaction commit status data</entry>
 </row>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 32cb985..0daa9bb 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,8 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
-	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
+	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..4221353
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "utils/timestamp.h"
+
+
+void
+commit_ts_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "zeropage: %d", pageno);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "truncate before: %d", pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *xlrec = (xl_commit_ts_set *) rec;
+		int		i;
+
+		appendStringInfo(buf, "set commit_ts %s for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->mainxid);
+		for (i = 0; i < xlrec->nsubxids; i++)
+			appendStringInfo(buf, ", %u", xlrec->subxids[i]);
+	}
+}
+
+const char *
+commit_ts_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case COMMIT_TS_ZEROPAGE:
+			id = "ZEROPAGE";
+			break;
+		case COMMIT_TS_TRUNCATE:
+			id = "TRUNCATE";
+			break;
+		case COMMIT_TS_SETTS:
+			id = "SETTS";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 4088ba9..6f79397 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest commit timestamp xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 82a6c76..a1979ca 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o committs.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/committs.c b/src/backend/access/transam/committs.c
new file mode 100644
index 0000000..1751a87
--- /dev/null
+++ b/src/backend/access/transam/committs.c
@@ -0,0 +1,844 @@
+/*-------------------------------------------------------------------------
+ *
+ * committs.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/committs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/committs.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMITTS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/* We need 8+4 bytes per xact */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	NodeIdRec		nodeid;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
+									sizeof(NodeIdRec))
+
+/* this is limited by how much data we can fit into SLRU cache */
+#define COMMIT_TS_MIN_BLCKSZ 2048
+
+#define COMMIT_TS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CommitTs control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variable */
+bool	track_commit_timestamp;
+
+NodeIdRec CommitTsDefaultNodeId = InvalidNodeId;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 NodeIdRec nodeid, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						  NodeIdRec nodeid, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 NodeIdRec nodeid);
+
+
+/*
+ * CommitTsSetDefaultNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+void
+CommitTsSetDefaultNodeId(NodeIdRec nodeid)
+{
+	CommitTsDefaultNodeId = nodeid;
+}
+
+/*
+ * CommitTsGetDefaultNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+NodeIdRec
+CommitTsGetDefaultNodeId(void)
+{
+	return CommitTsDefaultNodeId;
+}
+
+/*
+ * TransactionTreeSetCommitTsData
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ * The reason why tracking just the parent xid committs is not enough is that
+ * the subtrans SLRU does not stay valid across crashes (is not permanent) so we
+ * need to keep the information about them here. If the subtrans implementation
+ * changes in the future, we might want to revisit the decision of storing
+ * committs for each subxid.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+							   TransactionId *subxids, TimestampTz timestamp,
+							   NodeIdRec nodeid, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, nodeid);
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, nodeid,
+							 pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.nodeid = nodeid;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 NodeIdRec nodeid, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, ts, nodeid, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], ts, nodeid, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						 NodeIdRec nodeid, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry entry;
+
+	entry.time = ts;
+	entry.nodeid = nodeid;
+
+	memcpy(CommitTsCtl->shared->page_buffer[slotno] +
+				SizeOfCommitTimestampEntry * entryno,
+		   &entry, SizeOfCommitTimestampEntry);
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ */
+void
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 NodeIdRec *nodeid)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry entry;
+	TransactionId oldestCommitTs;
+
+	/* Error if module not enabled */
+	if (!track_commit_timestamp)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Cannot get commit timestamp data because \"track_commit_timestamp\" is not enabled")));
+	}
+
+	/*
+	 * Return empty if the requested value is older than what we have or
+	 * newer than newest we have.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs) ||
+		TransactionIdPrecedes(commitTsShared->xidLastCommit, xid))
+	{
+		if (ts)
+			TIMESTAMP_NOBEGIN(*ts);
+		if (nodeid)
+			*nodeid = InvalidNodeId;
+		return;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory;
+	 * if it's a hit, acquire a lock and read the data, after verifying
+	 * that it's still what we initially read.  Otherwise, fall through
+	 * to read from SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (nodeid)
+				*nodeid = commitTsShared->dataLastCommit.nodeid;
+			LWLockRelease(CommitTsLock);
+			return;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	memcpy(&entry,
+		   CommitTsCtl->shared->page_buffer[slotno] +
+				SizeOfCommitTimestampEntry * entryno,
+		   SizeOfCommitTimestampEntry);
+
+	if (ts)
+		*ts = entry.time;
+	if (nodeid)
+		*nodeid = entry.nodeid;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTsData(TimestampTz *ts, NodeIdRec *nodeid)
+{
+	TransactionId	xid;
+
+	/* Return empty if module not enabled */
+	if (!track_commit_timestamp)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Cannot get commit timestamp data because \"track_commit_timestamp\" is not enabled")));
+	}
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (nodeid)
+		*nodeid = commitTsShared->dataLastCommit.nodeid;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+Datum
+pg_xact_commit_timestamp(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		ts;
+
+	TransactionIdGetCommitTsData(xid, &ts, NULL);
+
+	if (TIMESTAMP_IS_NOBEGIN(ts))
+		PG_RETURN_NULL();
+
+	PG_RETURN_TIMESTAMPTZ(ts);
+}
+
+
+Datum
+pg_last_committed_xact(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		ts;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTsData(&ts, NULL);
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	if (xid == InvalidTransactionId)
+	{
+		memset(nulls, true, sizeof(nulls));
+	}
+	else
+	{
+		values[0] = TransactionIdGetDatum(xid);
+		nulls[0] = false;
+
+		values[1] = TimestampTzGetDatum(ts);
+		nulls[1] = false;
+	}
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Initialization of shared memory for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_commit_ts");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
+		commitTsShared->dataLastCommit.nodeid = InvalidNodeId;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.
+ * Must be called after recovery has finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+InitCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!track_commit_timestamp)
+	{
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsControlLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest value to the next Xid; that way, we will not try
+	 * to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+	}
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ *
+ * NB2: the current implementation relies on the fact that
+ * track_commit_timestamp is flagged as PGC_POSTMASTER
+ * (only possible to be set at server start).
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence, &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the earliest value for which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&pageno), sizeof(int));
+	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_ZEROPAGE);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&pageno), sizeof(int));
+	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_TRUNCATE);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 NodeIdRec nodeid)
+{
+	xl_commit_ts_set	record;
+
+	record.timestamp = timestamp;
+	record.nodeid = nodeid;
+	record.mainxid = mainxid;
+	record.nsubxids = nsubxids;
+	memcpy(record.subxids, subxids, sizeof(TransactionId) * nsubxids);
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&record), offsetof(xl_commit_ts_set, subxids) +
+										 nsubxids * sizeof(TransactionId));
+	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_SETTS);
+}
+
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+commit_ts_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in commit_ts records */
+	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *setts = (xl_commit_ts_set *) XLogRecGetData(record);
+
+		TransactionTreeSetCommitTsData(setts->mainxid, setts->nsubxids,
+									   setts->subxids, setts->timestamp,
+									   setts->nodeid, false);
+	}
+	else
+		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
+}
+
+/*
+ * Helper function for GUC
+ *
+ * Check if we can enable the track_commit_timestamp.
+ */
+bool
+check_track_commit_timestamp(bool *newval, void **extra, GucSource source)
+{
+	if (*newval && BLCKSZ < COMMIT_TS_MIN_BLCKSZ)
+	{
+		GUC_check_errmsg("Commit timestamps tacking cannot be enabled for builds with page size smaller than %d",
+						 COMMIT_TS_MIN_BLCKSZ);
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index befd60f..f24861c 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index d51cca4..d3287da 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -158,9 +159,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_committs too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 763e9de..d72b9e5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1135,6 +1136,20 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We only need to log the commit timestamp separately if the nodeid
+	 * is not InvalidNodeId since the commit record logged above already
+	 * contains the timestamp info and will be used to load it.
+	 */
+	if (markXidCommitted)
+	{
+		NodeIdRec nodeid = CommitTsGetDefaultNodeId();
+
+		TransactionTreeSetCommitTsData(xid, nchildren, children,
+									   xactStopTimestamp,
+									   nodeid, nodeid != InvalidNodeId);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4644,6 +4659,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4671,6 +4687,10 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit timestamp and metadata */
+	TransactionTreeSetCommitTsData(xid, nsubxacts, sub_xids,
+								   commit_time, InvalidNodeId, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4790,7 +4810,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4805,7 +4826,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2059bbe..6277c77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4518,6 +4519,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4527,6 +4529,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -4606,6 +4609,7 @@ BootStrapXLOG(void)
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
+	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
@@ -4614,6 +4618,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -5865,6 +5870,9 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("oldest commit timestamp Xid: %u",
+					checkPoint.oldestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -5876,6 +5884,7 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6098,11 +6107,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans
+			 * only. MultiXact has already been started up and other SLRUs are
+			 * not maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -6751,12 +6761,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -6792,6 +6803,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enables, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	InitCommitTs();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7358,6 +7375,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -7684,6 +7702,10 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -7961,6 +7983,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
@@ -8389,7 +8412,8 @@ XLogReportParameters(void)
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
-		max_locks_per_xact != ControlFile->max_locks_per_xact)
+		max_locks_per_xact != ControlFile->max_locks_per_xact ||
+		track_commit_timestamp != ControlFile->track_commit_timestamp)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -8409,6 +8433,7 @@ XLogReportParameters(void)
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
+			xlrec.track_commit_timestamp = track_commit_timestamp;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -8423,6 +8448,7 @@ XLogReportParameters(void)
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 		UpdateControlFile();
 	}
 }
@@ -8795,6 +8821,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6384dc7..23b5248 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1071,10 +1072,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG to the oldest computed value.  Note we don't truncate
-	 * multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG and CommitTs to the oldest computed value.
+	 * Note we don't truncate multixacts; that will be done by the next
+	 * checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1084,6 +1087,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	SetCommitTsLimit(frozenXID);
 }
 
 
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 1c7dac3..bc2574c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -133,6 +133,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
+		case RM_COMMIT_TS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..9025601 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 719181c..4b4b4bf 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* committs.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 23cbe90..0bd3616 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -826,6 +827,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&track_commit_timestamp,
+		false,
+		check_track_commit_timestamp, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4a89cb7..49141b2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -227,6 +227,7 @@
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
 #max_replication_slots = 0	# max number of replication slots
+#track_commit_timestamp = off	# collect timestamp of transaction commit
 				# (change requires restart)
 
 # - Master Server -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3b52867..3bee657 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -186,6 +186,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index b2e0793..a838bb5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -270,6 +270,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
@@ -300,6 +302,8 @@ main(int argc, char *argv[])
 		   ControlFile.max_prepared_xacts);
 	printf(_("Current max_locks_per_xact setting:   %d\n"),
 		   ControlFile.max_locks_per_xact);
+	printf(_("Current track_commit_timestamp setting: %s\n"),
+		   ControlFile.track_commit_timestamp ? _("on") : _("off"));
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 666e8db..8f67c18 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -63,6 +63,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_commit_ts = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -112,7 +113,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "c:D:e:fl:m:no:O:x:")) != -1)
 	{
 		switch (c)
 		{
@@ -158,6 +159,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_commit_ts = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_commit_ts == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -345,6 +361,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_commit_ts != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_commit_ts;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -539,6 +558,7 @@ GuessControlValues(void)
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -621,6 +641,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -702,6 +724,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_commit_ts != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -739,6 +767,7 @@ RewriteControlFile(void)
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -1099,6 +1128,7 @@ usage(void)
 	printf(_("%s resets the PostgreSQL transaction log.\n\n"), progname);
 	printf(_("Usage:\n  %s [OPTION]... {[-D] DATADIR}\n\n"), progname);
 	printf(_("Options:\n"));
+	printf(_("  -c XID           set the oldest transaction with retrievable commit timestamp\n"));
 	printf(_("  -e XIDEPOCH      set next transaction ID epoch\n"));
 	printf(_("  -f               force update to be done\n"));
 	printf(_("  -l XLOGFILE      force minimum WAL starting location for new transaction log\n"));
diff --git a/src/include/access/committs.h b/src/include/access/committs.h
new file mode 100644
index 0000000..05507ca
--- /dev/null
+++ b/src/include/access/committs.h
@@ -0,0 +1,70 @@
+/*
+ * committs.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/committs.h
+ */
+#ifndef COMMITTS_H
+#define COMMITTS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+#include "utils/guc.h"
+
+extern PGDLLIMPORT bool	track_commit_timestamp;
+extern bool check_track_commit_timestamp(bool *newval, void **extra,
+										 GucSource source);
+
+typedef uint32 NodeIdRec;
+
+#define InvalidNodeId 0
+
+extern void CommitTsSetDefaultNodeId(NodeIdRec nodeid);
+extern NodeIdRec CommitTsGetDefaultNodeId(void);
+extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+										   TransactionId *subxids,
+										   TimestampTz timestamp,
+										   NodeIdRec nodeid,
+										   bool do_xlog);
+extern void TransactionIdGetCommitTsData(TransactionId xid,
+										 TimestampTz *ts,
+										 NodeIdRec *nodeid);
+extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
+										   NodeIdRec *nodeid);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void InitCommitTs(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMIT_TS_ZEROPAGE		0x00
+#define COMMIT_TS_TRUNCATE		0x10
+#define COMMIT_TS_SETTS			0x20
+
+typedef struct xl_commit_ts_set
+{
+	TimestampTz		timestamp;
+	NodeIdRec		nodeid;
+	TransactionId	mainxid;
+	int				nsubxids;
+	TransactionId	subxids[FLEXIBLE_ARRAY_MEMBER];
+} xl_commit_ts_set;
+
+
+extern void commit_ts_redo(XLogReaderState *record);
+extern void commit_ts_desc(StringInfo buf, XLogReaderState *record);
+extern const char *commit_ts_identify(uint8 info);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 76a6421..27168c3 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,7 +24,7 @@
  * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, startup, cleanup */
+/* symbol name, textual name, redo, desc, identify, startup, cleanup */
 PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
 PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
 PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
@@ -43,3 +43,4 @@ PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_start
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..b59fd98 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,11 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsControlLock
+	 */
+	TransactionId oldestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 423ef4d..7245d8d 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -186,6 +186,7 @@ typedef struct xl_parameter_change
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
+	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ba79d25..70afbd1 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,7 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
@@ -176,6 +177,7 @@ typedef struct ControlFileData
 	int			max_worker_processes;
 	int			max_prepared_xacts;
 	int			max_locks_per_xact;
+	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5d4e889..da93201 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3017,6 +3017,12 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3581 ( pg_xact_commit_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_xact_commit_timestamp _null_ _null_ _null_ ));
+DESCR("get commit timestamp of a transaction");
+
+DATA(insert OID = 3583 ( pg_last_committed_xact PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184}" "{o,o}" "{xid,timestamp}" _null_ pg_last_committed_xact _null_ _null_ _null_ ));
+DESCR("get transaction Id and commit timestamp of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 91cab87..09654a8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 417fd17..565cff3 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1187,6 +1187,10 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_xact_commit_timestamp(PG_FUNCTION_ARGS);
+extern Datum pg_last_committed_xact(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/regress/expected/committs.out b/src/test/regress/expected/committs.out
new file mode 100644
index 0000000..cb1ea46
--- /dev/null
+++ b/src/test/regress/expected/committs.out
@@ -0,0 +1,25 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ off
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ERROR:  Cannot get commit timestamp data because "track_commit_timestamp" is not enabled
diff --git a/src/test/regress/expected/committs_1.out b/src/test/regress/expected/committs_1.out
new file mode 100644
index 0000000..c1d24c5
--- /dev/null
+++ b/src/test/regress/expected/committs_1.out
@@ -0,0 +1,39 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ on
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ pg_xact_commit_timestamp 
+--------------------------
+ 
+(1 row)
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ ?column? | ?column? | ?column? 
+----------+----------+----------
+ t        | t        | t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index e1afd4b..324b01a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,7 +88,7 @@ test: brin gin gist spgist privileges security_label collate matview lock replic
 # ----------
 # Another group of parallel tests
 # ----------
-test: alter_generic misc psql async
+test: alter_generic misc psql async committs
 
 # rules cannot run concurrently with any test that creates a view
 test: rules
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index e609ab0..4de606a 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -151,3 +151,4 @@ test: largeobject
 test: with
 test: xml
 test: stats
+test: committs
diff --git a/src/test/regress/sql/committs.sql b/src/test/regress/sql/committs.sql
new file mode 100644
index 0000000..a72705d
--- /dev/null
+++ b/src/test/regress/sql/committs.sql
@@ -0,0 +1,23 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
+
+SELECT pg_xact_commit_timestamp('0'::xid);
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
#118Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#117)
1 attachment(s)
Re: tracking commit timestamps

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs-v11.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pg_upgrade/pg_upgrade.c
--- b/contrib/pg_upgrade/pg_upgrade.c
***************
*** 423,430 **** copy_clog_xlog_xid(void)
  	/* set the next transaction id and epoch of the new cluster */
  	prep_status("Setting next transaction ID and epoch for new cluster");
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
! 			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
  			  new_cluster.pgdata);
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
  			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
--- 423,432 ----
  	/* set the next transaction id and epoch of the new cluster */
  	prep_status("Setting next transaction ID and epoch for new cluster");
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
! 			  new_cluster.bindir,
! 			  old_cluster.controldata.chkpnt_nxtxid,
! 			  old_cluster.controldata.chkpnt_nxtxid,
  			  new_cluster.pgdata);
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
  			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 10,15 ****
--- 10,16 ----
  
  #include "access/brin_xlog.h"
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/gin.h"
  #include "access/gist_private.h"
  #include "access/hash.h"
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2673,2678 **** include_dir 'conf.d'
--- 2673,2692 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+       <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+       <indexterm>
+        <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Record commit time of transactions. This parameter
+         can only be set in <filename>postgresql.conf</> file or on the server
+         command line. The default value is <literal>off</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       </variablelist>
      </sect2>
  
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 15923,15928 **** SELECT collation for ('foo' COLLATE "de_DE");
--- 15923,15960 ----
      For example <literal>10:20:10,14,15</literal> means
      <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
     </para>
+ 
+    <para>
+     The functions shown in <xref linkend="functions-committs">
+     provide information about transactions that have been already committed.
+     These functions mainly provide information about when the transactions
+     were committed. They only provide useful data when
+     <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+     and only for transactions that were committed after it was enabled.
+    </para>
+ 
+    <table id="functions-committs">
+     <title>Committed transaction information</title>
+     <tgroup cols="3">
+      <thead>
+       <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+      </thead>
+ 
+      <tbody>
+       <row>
+        <entry><literal><function>pg_xact_commit_timestamp(<parameter>xid</parameter>)</function></literal></entry>
+        <entry><type>timestamp with time zone</type></entry>
+        <entry>get commit timestamp of a transaction</entry>
+       </row>
+       <row>
+        <entry><literal><function>pg_last_committed_xact()</function></literal></entry>
+        <entry><parameter>xid</> <type>xid</>, <parameter>timestamp</> <type>timestamp with time zone</></entry>
+        <entry>get transaction Id and commit timestamp of latest transaction commit</entry>
+       </row>
+      </tbody>
+     </tgroup>
+    </table>
+ 
    </sect1>
  
    <sect1 id="functions-admin">
*** a/doc/src/sgml/ref/pg_resetxlog.sgml
--- b/doc/src/sgml/ref/pg_resetxlog.sgml
***************
*** 22,27 **** PostgreSQL documentation
--- 22,28 ----
   <refsynopsisdiv>
    <cmdsynopsis>
     <command>pg_resetxlog</command>
+    <arg choice="opt"><option>-c</option> <replaceable class="parameter">xid</replaceable></arg>
     <arg choice="opt"><option>-f</option></arg>
     <arg choice="opt"><option>-n</option></arg>
     <arg choice="opt"><option>-o</option> <replaceable class="parameter">oid</replaceable></arg>
***************
*** 77,88 **** PostgreSQL documentation
    </para>
  
    <para>
!    The <option>-o</>, <option>-x</>, <option>-e</>,
!    <option>-m</>, <option>-O</>,
!    and <option>-l</>
     options allow the next OID, next transaction ID, next transaction ID's
!    epoch, next and oldest multitransaction ID, next multitransaction offset, and WAL
!    starting address values to be set manually.  These are only needed when
     <command>pg_resetxlog</command> is unable to determine appropriate values
     by reading <filename>pg_control</>.  Safe values can be determined as
     follows:
--- 78,89 ----
    </para>
  
    <para>
!    The <option>-o</>, <option>-x</>, <option>-m</>, <option>-O</>,
!    <option>-l</> and <option>-e</>
     options allow the next OID, next transaction ID, next transaction ID's
!    epoch, next and oldest multitransaction ID, next multitransaction offset, WAL
!    starting address and the oldest transaction ID for which the commit time can
!    be retrieved values to be set manually.  These are only needed when
     <command>pg_resetxlog</command> is unable to determine appropriate values
     by reading <filename>pg_control</>.  Safe values can be determined as
     follows:
***************
*** 130,135 **** PostgreSQL documentation
--- 131,145 ----
  
      <listitem>
       <para>
+       A safe value for the oldest transaction ID for which the commit time can
+       be retrieved (<option>-c</>) can be determined by looking for the
+       numerically smallest file name in the directory <filename>pg_committs</>
+       under the data directory.  As above, the file names are in hexadecimal.
+      </para>
+     </listitem>
+ 
+     <listitem>
+      <para>
        The WAL starting address (<option>-l</>) should be
        larger than any WAL segment file name currently existing in
        the directory <filename>pg_xlog</> under the data directory.
*** a/doc/src/sgml/storage.sgml
--- b/doc/src/sgml/storage.sgml
***************
*** 67,72 **** Item
--- 67,77 ----
  </row>
  
  <row>
+  <entry><filename>pg_commit_ts</></entry>
+  <entry>Subdirectory containing transaction commit timestamp data</entry>
+ </row>
+ 
+ <row>
   <entry><filename>pg_clog</></entry>
   <entry>Subdirectory containing transaction commit status data</entry>
  </row>
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 8,14 **** subdir = src/backend/access/rmgrdesc
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = brindesc.o clogdesc.o dbasedesc.o gindesc.o gistdesc.o \
  	   hashdesc.o heapdesc.o \
  	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
  	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
--- 8,14 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
  	   hashdesc.o heapdesc.o \
  	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
  	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
*** /dev/null
--- b/src/backend/access/rmgrdesc/committsdesc.c
***************
*** 0 ****
--- 1,82 ----
+ /*-------------------------------------------------------------------------
+  *
+  * committsdesc.c
+  *    rmgr descriptor routines for access/transam/committs.c
+  *
+  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *    src/backend/access/rmgrdesc/committsdesc.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "access/commit_ts.h"
+ #include "utils/timestamp.h"
+ 
+ 
+ void
+ commit_ts_desc(StringInfo buf, XLogReaderState *record)
+ {
+ 	char	   *rec = XLogRecGetData(record);
+ 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ 
+ 	if (info == COMMIT_TS_ZEROPAGE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(&pageno, rec, sizeof(int));
+ 		appendStringInfo(buf, "%d", pageno);
+ 	}
+ 	else if (info == COMMIT_TS_TRUNCATE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(&pageno, rec, sizeof(int));
+ 		appendStringInfo(buf, "%d", pageno);
+ 	}
+ 	else if (info == COMMIT_TS_SETTS)
+ 	{
+ 		xl_commit_ts_set *xlrec = (xl_commit_ts_set *) rec;
+ 		int		nsubxids;
+ 
+ 		appendStringInfo(buf, "set %s/%d for: %u",
+ 						 timestamptz_to_str(xlrec->timestamp),
+ 						 xlrec->nodeid,
+ 						 xlrec->mainxid);
+ 		nsubxids = ((XLogRecGetDataLen(record) - SizeOfCommitTsSet) /
+ 					sizeof(TransactionId));
+ 		if (nsubxids > 0)
+ 		{
+ 			int		i;
+ 			TransactionId *subxids;
+ 
+ 			subxids = palloc(sizeof(TransactionId) * nsubxids);
+ 			memcpy(subxids,
+ 				   XLogRecGetData(record) + SizeOfCommitTsSet,
+ 				   sizeof(TransactionId) * nsubxids);
+ 			for (i = 0; i < nsubxids; i++)
+ 				appendStringInfo(buf, ", %u", subxids[i]);
+ 			pfree(subxids);
+ 		}
+ 	}
+ }
+ 
+ const char *
+ commit_ts_identify(uint8 info)
+ {
+ 	switch (info)
+ 	{
+ 		case COMMIT_TS_ZEROPAGE:
+ 			return "ZEROPAGE";
+ 		case COMMIT_TS_TRUNCATE:
+ 			return "TRUNCATE";
+ 		case COMMIT_TS_SETTS:
+ 			return "SETTS";
+ 		default:
+ 			return NULL;
+ 	}
+ }
*** a/src/backend/access/rmgrdesc/xlogdesc.c
--- b/src/backend/access/rmgrdesc/xlogdesc.c
***************
*** 45,51 **** xlog_desc(StringInfo buf, XLogReaderState *record)
  		appendStringInfo(buf, "redo %X/%X; "
  						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
! 						 "oldest running xid %u; %s",
  				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->PrevTimeLineID,
--- 45,51 ----
  		appendStringInfo(buf, "redo %X/%X; "
  						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
! 						 "oldest commit timestamp xid: %u; oldest running xid %u; %s",
  				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->PrevTimeLineID,
***************
*** 58,63 **** xlog_desc(StringInfo buf, XLogReaderState *record)
--- 58,64 ----
  						 checkpoint->oldestXidDB,
  						 checkpoint->oldestMulti,
  						 checkpoint->oldestMultiDB,
+ 						 checkpoint->oldestCommitTs,
  						 checkpoint->oldestActiveXid,
  				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
  	}
*** a/src/backend/access/transam/Makefile
--- b/src/backend/access/transam/Makefile
***************
*** 12,19 **** subdir = src/backend/access/transam
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
! 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
  	xloginsert.o xlogreader.o xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 12,20 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = clog.o commit_ts.o multixact.o rmgr.o slru.o subtrans.o \
! 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
! 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
  	xloginsert.o xlogreader.o xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 419,425 **** TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
   *
   * Testing during the PostgreSQL 9.2 development cycle revealed that on a
   * large multi-processor system, it was possible to have more CLOG page
!  * requests in flight at one time than the numebr of CLOG buffers which existed
   * at that time, which was hardcoded to 8.  Further testing revealed that
   * performance dropped off with more than 32 CLOG buffers, possibly because
   * the linear buffer search algorithm doesn't scale well.
--- 419,425 ----
   *
   * Testing during the PostgreSQL 9.2 development cycle revealed that on a
   * large multi-processor system, it was possible to have more CLOG page
!  * requests in flight at one time than the number of CLOG buffers which existed
   * at that time, which was hardcoded to 8.  Further testing revealed that
   * performance dropped off with more than 32 CLOG buffers, possibly because
   * the linear buffer search algorithm doesn't scale well.
*** /dev/null
--- b/src/backend/access/transam/commit_ts.c
***************
*** 0 ****
--- 1,848 ----
+ /*-------------------------------------------------------------------------
+  *
+  * commit_ts.c
+  *		PostgreSQL commit timestamp manager
+  *
+  * This module is a pg_clog-like system that stores the commit timestamp
+  * for each transaction.
+  *
+  * XLOG interactions: this module generates an XLOG record whenever a new
+  * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+  * generated for setting of values when the caller requests it; this allows
+  * us to support values coming from places other than transaction commit.
+  * Other writes of CommitTS come from recording of transaction commit in
+  * xact.c, which generates its own XLOG records for these events and will
+  * re-perform the status update on redo; so we need make no additional XLOG
+  * entry here.
+  *
+  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/backend/access/transam/commit_ts.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "access/commit_ts.h"
+ #include "access/htup_details.h"
+ #include "access/slru.h"
+ #include "access/transam.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "miscadmin.h"
+ #include "pg_trace.h"
+ #include "utils/builtins.h"
+ #include "utils/snapmgr.h"
+ #include "utils/timestamp.h"
+ 
+ /*
+  * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+  * everywhere else in Postgres.
+  *
+  * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+  * CommitTs page numbering also wraps around at
+  * 0xFFFFFFFF/COMMIT_TS_XACTS_PER_PAGE, and CommitTs segment numbering at
+  * 0xFFFFFFFF/COMMIT_TS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+  * explicit notice of that fact in this module, except when comparing segment
+  * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+  */
+ 
+ /*
+  * We need 8+4 bytes per xact.  Note that enlarging this struct might mean
+  * the largest possible file name is more than 5 chars long; see
+  * SlruScanDirectory.
+  */
+ typedef struct CommitTimestampEntry
+ {
+ 	TimestampTz		time;
+ 	CommitTsNodeId	nodeid;
+ } CommitTimestampEntry;
+ 
+ #define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
+ 									sizeof(CommitTsNodeId))
+ 
+ #define COMMIT_TS_XACTS_PER_PAGE \
+ 	(BLCKSZ / SizeOfCommitTimestampEntry)
+ 
+ #define TransactionIdToCTsPage(xid)	\
+ 	((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+ #define TransactionIdToCTsEntry(xid)	\
+ 	((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+ 
+ /*
+  * Link to shared-memory data structures for CommitTs control
+  */
+ static SlruCtlData CommitTsCtlData;
+ 
+ #define CommitTsCtl (&CommitTsCtlData)
+ 
+ /*
+  * We keep a cache of the last value set in shared memory.  This is protected
+  * by CommitTsLock.
+  */
+ typedef struct CommitTimestampShared
+ {
+ 	TransactionId	xidLastCommit;
+ 	CommitTimestampEntry dataLastCommit;
+ } CommitTimestampShared;
+ 
+ CommitTimestampShared	*commitTsShared;
+ 
+ 
+ /* GUC variable */
+ bool	track_commit_timestamp;
+ 
+ static CommitTsNodeId default_node_id = InvalidCommitTsNodeId;
+ 
+ static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+ 					 TransactionId *subxids, TimestampTz ts,
+ 					 CommitTsNodeId nodeid, int pageno);
+ static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+ 						  CommitTsNodeId nodeid, int slotno);
+ static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+ static bool CommitTsPagePrecedes(int page1, int page2);
+ static void WriteZeroPageXlogRec(int pageno);
+ static void WriteTruncateXlogRec(int pageno);
+ static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+ 						 TransactionId *subxids, TimestampTz timestamp,
+ 						 CommitTsNodeId nodeid);
+ 
+ 
+ /*
+  * CommitTsSetDefaultNodeId
+  *
+  * Set default nodeid for current backend.
+  */
+ void
+ CommitTsSetDefaultNodeId(CommitTsNodeId nodeid)
+ {
+ 	default_node_id = nodeid;
+ }
+ 
+ /*
+  * CommitTsGetDefaultNodeId
+  *
+  * Set default nodeid for current backend.
+  */
+ CommitTsNodeId
+ CommitTsGetDefaultNodeId(void)
+ {
+ 	return default_node_id;
+ }
+ 
+ /*
+  * TransactionTreeSetCommitTsData
+  *
+  * Record the final commit timestamp of transaction entries in the commit log
+  * for a transaction and its subtransaction tree, as efficiently as possible.
+  *
+  * xid is the top level transaction id.
+  *
+  * subxids is an array of xids of length nsubxids, representing subtransactions
+  * in the tree of xid. In various cases nsubxids may be zero.
+  * The reason why tracking just the parent xid commit timestamp is not enough
+  * is that the subtrans SLRU does not stay valid across crashes (it's not
+  * permanent) so we need to keep the information about them here. If the
+  * subtrans implementation changes in the future, we might want to revisit the
+  * decision of storing timestamp info for each subxid.
+  *
+  * The do_xlog parameter tells us whether to include a XLog record of this
+  * or not.  Normal path through RecordTransactionCommit() will be related
+  * to a transaction commit XLog record, and so should pass "false" here.
+  * Other callers probably want to pass true, so that the given values persist
+  * in case of crashes.
+  */
+ void
+ TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+ 							   TransactionId *subxids, TimestampTz timestamp,
+ 							   CommitTsNodeId nodeid, bool do_xlog)
+ {
+ 	int			i;
+ 	TransactionId headxid;
+ 
+ 	Assert(xid != InvalidTransactionId);
+ 
+ 	if (!track_commit_timestamp)
+ 		return;
+ 
+ 	/*
+ 	 * Comply with the WAL-before-data rule: if caller specified it wants
+ 	 * this value to be recorded in WAL, do so before touching the data.
+ 	 */
+ 	if (do_xlog)
+ 		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, nodeid);
+ 
+ 	/*
+ 	 * We split the xids to set the timestamp to in groups belonging to the
+ 	 * same SLRU page; the first element in each such set is its head.  The
+ 	 * first group has the main XID as the head; subsequent sets use the
+ 	 * first subxid not on the previous page as head.  This way, we only have
+ 	 * to lock/modify each SLRU page once.
+ 	 */
+ 	for (i = 0, headxid = xid;;)
+ 	{
+ 		int			pageno = TransactionIdToCTsPage(headxid);
+ 		int			j;
+ 
+ 		for (j = i; j < nsubxids; j++)
+ 		{
+ 			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+ 				break;
+ 		}
+ 		/* subxids[i..j] are on the same page as the head */
+ 
+ 		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, nodeid,
+ 							 pageno);
+ 
+ 		/* if we wrote out all subxids, we're done. */
+ 		if (j + 1 >= nsubxids)
+ 			break;
+ 
+ 		/*
+ 		 * Set the new head and skip over it, as well as over the subxids
+ 		 * we just wrote.
+ 		 */
+ 		headxid = subxids[j];
+ 		i += j - i + 1;
+ 	}
+ 
+ 	/*
+ 	 * Update the cached value in shared memory
+ 	 */
+ 	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+ 	commitTsShared->xidLastCommit = xid;
+ 	commitTsShared->dataLastCommit.time = timestamp;
+ 	commitTsShared->dataLastCommit.nodeid = nodeid;
+ 	LWLockRelease(CommitTsLock);
+ }
+ 
+ /*
+  * Record the commit timestamp of transaction entries in the commit log for all
+  * entries on a single page.  Atomic only on this page.
+  */
+ static void
+ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+ 					 TransactionId *subxids, TimestampTz ts,
+ 					 CommitTsNodeId nodeid, int pageno)
+ {
+ 	int			slotno;
+ 	int			i;
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+ 
+ 	TransactionIdSetCommitTs(xid, ts, nodeid, slotno);
+ 	for (i = 0; i < nsubxids; i++)
+ 		TransactionIdSetCommitTs(subxids[i], ts, nodeid, slotno);
+ 
+ 	CommitTsCtl->shared->page_dirty[slotno] = true;
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Sets the commit timestamp of a single transaction.
+  *
+  * Must be called with CommitTsControlLock held
+  */
+ static void
+ TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+ 						 CommitTsNodeId nodeid, int slotno)
+ {
+ 	int			entryno = TransactionIdToCTsEntry(xid);
+ 	CommitTimestampEntry entry;
+ 
+ 	entry.time = ts;
+ 	entry.nodeid = nodeid;
+ 
+ 	memcpy(CommitTsCtl->shared->page_buffer[slotno] +
+ 		   SizeOfCommitTimestampEntry * entryno,
+ 		   &entry, SizeOfCommitTimestampEntry);
+ }
+ 
+ /*
+  * Interrogate the commit timestamp of a transaction.
+  */
+ void
+ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+ 							 CommitTsNodeId *nodeid)
+ {
+ 	int			pageno = TransactionIdToCTsPage(xid);
+ 	int			entryno = TransactionIdToCTsEntry(xid);
+ 	int			slotno;
+ 	CommitTimestampEntry entry;
+ 	TransactionId oldestCommitTs;
+ 
+ 	/* Error if module not enabled */
+ 	if (!track_commit_timestamp)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("could not get commit timestamp data"),
+ 				 errhint("Make sure the configuration parameter \"%s\" is set.",
+ 						 "track_commit_timestamp")));
+ 
+ 	/*
+ 	 * Return empty if the requested value is older than what we have or newer
+ 	 * than newest we have.  The reason it's acceptable to use an unlocked read
+ 	 * for xidLastCommit is that that value can only move forwards, and it's
+ 	 * okay to read a value slightly older than the one we read below.
+ 	 */
+ 	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+ 	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+ 	LWLockRelease(CommitTsControlLock);
+ 
+ 	if (!TransactionIdIsValid(oldestCommitTs) ||
+ 		TransactionIdPrecedes(xid, oldestCommitTs) ||
+ 		TransactionIdPrecedes(commitTsShared->xidLastCommit, xid))
+ 	{
+ 		if (ts)
+ 			TIMESTAMP_NOBEGIN(*ts);
+ 		if (nodeid)
+ 			*nodeid = InvalidCommitTsNodeId;
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * Use an unlocked atomic read on our cached value in shared memory; if
+ 	 * it's a hit, acquire a lock and read the data, after verifying that it's
+ 	 * still what we initially read.  Otherwise, fall through to read from
+ 	 * SLRU.
+ 	 */
+ 	if (commitTsShared->xidLastCommit == xid)
+ 	{
+ 		LWLockAcquire(CommitTsLock, LW_SHARED);
+ 		if (commitTsShared->xidLastCommit == xid)
+ 		{
+ 			if (ts)
+ 				*ts = commitTsShared->dataLastCommit.time;
+ 			if (nodeid)
+ 				*nodeid = commitTsShared->dataLastCommit.nodeid;
+ 			LWLockRelease(CommitTsLock);
+ 			return;
+ 		}
+ 		LWLockRelease(CommitTsLock);
+ 	}
+ 
+ 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+ 	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ 	memcpy(&entry,
+ 		   CommitTsCtl->shared->page_buffer[slotno] +
+ 		   SizeOfCommitTimestampEntry * entryno,
+ 		   SizeOfCommitTimestampEntry);
+ 
+ 	if (ts)
+ 		*ts = entry.time;
+ 	if (nodeid)
+ 		*nodeid = entry.nodeid;
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Return the Xid of the latest committed transaction.  (As far as this module
+  * is concerned, anyway; it's up to the caller to ensure the value is useful
+  * for its purposes.)
+  *
+  * ts and extra are filled with the corresponding data; they can be passed
+  * as NULL if not wanted.
+  */
+ TransactionId
+ GetLatestCommitTsData(TimestampTz *ts, CommitTsNodeId *nodeid)
+ {
+ 	TransactionId	xid;
+ 
+ 	/* Error if module not enabled */
+ 	if (!track_commit_timestamp)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("could not get commit timestamp data"),
+ 				 errhint("Make sure the configuration parameter \"%s\" is set.",
+ 						 "track_commit_timestamp")));
+ 
+ 	LWLockAcquire(CommitTsLock, LW_SHARED);
+ 	xid = commitTsShared->xidLastCommit;
+ 	if (ts)
+ 		*ts = commitTsShared->dataLastCommit.time;
+ 	if (nodeid)
+ 		*nodeid = commitTsShared->dataLastCommit.nodeid;
+ 	LWLockRelease(CommitTsLock);
+ 
+ 	return xid;
+ }
+ 
+ /*
+  * SQL-callable wrapper to obtain commit time of a transaction
+  */
+ Datum
+ pg_xact_commit_timestamp(PG_FUNCTION_ARGS)
+ {
+ 	TransactionId	xid = PG_GETARG_UINT32(0);
+ 	TimestampTz		ts;
+ 
+ 	TransactionIdGetCommitTsData(xid, &ts, NULL);
+ 
+ 	if (TIMESTAMP_IS_NOBEGIN(ts))
+ 		PG_RETURN_NULL();
+ 
+ 	PG_RETURN_TIMESTAMPTZ(ts);
+ }
+ 
+ 
+ Datum
+ pg_last_committed_xact(PG_FUNCTION_ARGS)
+ {
+ 	TransactionId	xid;
+ 	TimestampTz		ts;
+ 	Datum       values[2];
+ 	bool        nulls[2];
+ 	TupleDesc   tupdesc;
+ 	HeapTuple	htup;
+ 
+ 	/* and construct a tuple with our data */
+ 	xid = GetLatestCommitTsData(&ts, NULL);
+ 
+ 	/*
+ 	 * Construct a tuple descriptor for the result row.  This must match this
+ 	 * function's pg_proc entry!
+ 	 */
+ 	tupdesc = CreateTemplateTupleDesc(2, false);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+ 					   XIDOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+ 					   TIMESTAMPTZOID, -1, 0);
+ 	tupdesc = BlessTupleDesc(tupdesc);
+ 
+ 	if (xid == InvalidTransactionId)
+ 	{
+ 		memset(nulls, true, sizeof(nulls));
+ 	}
+ 	else
+ 	{
+ 		values[0] = TransactionIdGetDatum(xid);
+ 		nulls[0] = false;
+ 
+ 		values[1] = TimestampTzGetDatum(ts);
+ 		nulls[1] = false;
+ 	}
+ 
+ 	htup = heap_form_tuple(tupdesc, values, nulls);
+ 
+ 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+ 
+ 
+ /*
+  * Number of shared CommitTS buffers.
+  *
+  * We use a very similar logic as for the number of CLOG buffers; see comments
+  * in CLOGShmemBuffers.
+  */
+ Size
+ CommitTsShmemBuffers(void)
+ {
+ 	return Min(16, Max(4, NBuffers / 1024));
+ }
+ 
+ /*
+  * Shared memory sizing for CommitTs
+  */
+ Size
+ CommitTsShmemSize(void)
+ {
+ 	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+ 		sizeof(CommitTimestampShared);
+ }
+ 
+ /*
+  * Initialize CommitTs at system startup (postmaster start or standalone
+  * backend)
+  */
+ void
+ CommitTsShmemInit(void)
+ {
+ 	bool	found;
+ 
+ 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+ 	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+ 				  CommitTsControlLock, "pg_commit_ts");
+ 
+ 	commitTsShared = ShmemInitStruct("CommitTs shared",
+ 									 sizeof(CommitTimestampShared),
+ 									 &found);
+ 
+ 	if (!IsUnderPostmaster)
+ 	{
+ 		Assert(!found);
+ 
+ 		commitTsShared->xidLastCommit = InvalidTransactionId;
+ 		TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
+ 		commitTsShared->dataLastCommit.nodeid = InvalidCommitTsNodeId;
+ 	}
+ 	else
+ 		Assert(found);
+ }
+ 
+ /*
+  * This function must be called ONCE on system install.
+  *
+  * (The CommitTs directory is assumed to have been created by initdb, and
+  * CommitTsShmemInit must have been called already.)
+  */
+ void
+ BootStrapCommitTs(void)
+ {
+ 	/*
+ 	 * Nothing to do here at present, unlike most other SLRU modules; segments
+ 	 * are created when the server is started with this module enabled.
+ 	 * See StartupCommitTs.
+ 	 */
+ }
+ 
+ /*
+  * Initialize (or reinitialize) a page of CommitTs to zeroes.
+  * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+  *
+  * The page is not actually written, just set up in shared memory.
+  * The slot number of the new page is returned.
+  *
+  * Control lock must be held at entry, and will be held at exit.
+  */
+ static int
+ ZeroCommitTsPage(int pageno, bool writeXlog)
+ {
+ 	int			slotno;
+ 
+ 	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+ 
+ 	if (writeXlog)
+ 		WriteZeroPageXlogRec(pageno);
+ 
+ 	return slotno;
+ }
+ 
+ /*
+  * This must be called ONCE during postmaster or standalone-backend startup,
+  * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+  */
+ void
+ StartupCommitTs(void)
+ {
+ 	TransactionId xid = ShmemVariableCache->nextXid;
+ 	int			pageno = TransactionIdToCTsPage(xid);
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * Initialize our idea of the latest page number.
+ 	 */
+ 	CommitTsCtl->shared->latest_page_number = pageno;
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * This must be called ONCE during postmaster or standalone-backend startup,
+  * when commit timestamp is enabled.  Must be called after recovery has
+  * finished.
+  *
+  * This is in charge of creating the currently active segment, if it's not
+  * already there.  The reason for this is that the server might have been
+  * running with this module disabled for a while and thus might have skipped
+  * the normal creation point.
+  */
+ void
+ CompleteCommitTsInitialization(void)
+ {
+ 	TransactionId xid = ShmemVariableCache->nextXid;
+ 	int			pageno = TransactionIdToCTsPage(xid);
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * Re-Initialize our idea of the latest page number.
+ 	 */
+ 	CommitTsCtl->shared->latest_page_number = pageno;
+ 
+ 	/*
+ 	 * If this module is not currently enabled, make sure we don't hand back
+ 	 * possibly-invalid data; also remove segments of old data.
+ 	 */
+ 	if (!track_commit_timestamp)
+ 	{
+ 		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+ 		LWLockRelease(CommitTsControlLock);
+ 
+ 		TruncateCommitTs(ReadNewTransactionId());
+ 
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+ 	 * need to set the oldest value to the next Xid; that way, we will not try
+ 	 * to read data that might not have been set.
+ 	 *
+ 	 * XXX does this have a problem if a server is started with commitTs
+ 	 * enabled, then started with commitTs disabled, then restarted with it
+ 	 * enabled again?  It doesn't look like it does, because there should be a
+ 	 * checkpoint that sets the value to InvalidTransactionId at end of
+ 	 * recovery; and so any chance of injecting new transactions without
+ 	 * CommitTs values would occur after the oldestCommitTs has been set to
+ 	 * Invalid temporarily.
+ 	 */
+ 	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+ 		ShmemVariableCache->oldestCommitTs = ReadNewTransactionId();
+ 
+ 	/* Finally, create the current segment file, if necessary */
+ 	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+ 	{
+ 		int		slotno;
+ 
+ 		slotno = ZeroCommitTsPage(pageno, false);
+ 		SimpleLruWritePage(CommitTsCtl, slotno);
+ 		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+ 	}
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * This must be called ONCE during postmaster or standalone-backend shutdown
+  */
+ void
+ ShutdownCommitTs(void)
+ {
+ 	/* Flush dirty CommitTs pages to disk */
+ 	SimpleLruFlush(CommitTsCtl, false);
+ }
+ 
+ /*
+  * Perform a checkpoint --- either during shutdown, or on-the-fly
+  */
+ void
+ CheckPointCommitTs(void)
+ {
+ 	/* Flush dirty CommitTs pages to disk */
+ 	SimpleLruFlush(CommitTsCtl, true);
+ }
+ 
+ /*
+  * Make sure that CommitTs has room for a newly-allocated XID.
+  *
+  * NB: this is called while holding XidGenLock.  We want it to be very fast
+  * most of the time; even when it's not so fast, no actual I/O need happen
+  * unless we're forced to write out a dirty CommitTs or xlog page to make room
+  * in shared memory.
+  *
+  * NB: the current implementation relies on track_commit_timestamp being
+  * PGC_POSTMASTER.
+  */
+ void
+ ExtendCommitTs(TransactionId newestXact)
+ {
+ 	int			pageno;
+ 
+ 	/* nothing to do if module not enabled */
+ 	if (!track_commit_timestamp)
+ 		return;
+ 
+ 	/*
+ 	 * No work except at first XID of a page.  But beware: just after
+ 	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+ 	 */
+ 	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+ 		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+ 		return;
+ 
+ 	pageno = TransactionIdToCTsPage(newestXact);
+ 
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 	/* Zero the page and make an XLOG entry about it */
+ 	ZeroCommitTsPage(pageno, !InRecovery);
+ 
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Remove all CommitTs segments before the one holding the passed
+  * transaction ID.
+  *
+  * Note that we don't need to flush XLOG here.
+  */
+ void
+ TruncateCommitTs(TransactionId oldestXact)
+ {
+ 	int			cutoffPage;
+ 
+ 	/*
+ 	 * The cutoff point is the start of the segment containing oldestXact. We
+ 	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+ 	 */
+ 	cutoffPage = TransactionIdToCTsPage(oldestXact);
+ 
+ 	/* Check to see if there's any files that could be removed */
+ 	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence,
+ 						   &cutoffPage))
+ 		return;					/* nothing to remove */
+ 
+ 	/* Write XLOG record */
+ 	WriteTruncateXlogRec(cutoffPage);
+ 
+ 	/* Now we can remove the old CommitTs segment(s) */
+ 	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+ }
+ 
+ /*
+  * Set the earliest value for which commit TS can be consulted.
+  */
+ void
+ SetCommitTsLimit(TransactionId oldestXact)
+ {
+ 	/*
+ 	 * Be careful not to overwrite values that are either further into the
+ 	 * "future" or signal a disabled committs.
+ 	 */
+ 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+ 		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+ 		ShmemVariableCache->oldestCommitTs = oldestXact;
+ 	LWLockRelease(CommitTsControlLock);
+ }
+ 
+ /*
+  * Decide which of two CLOG page numbers is "older" for truncation purposes.
+  *
+  * We need to use comparison of TransactionIds here in order to do the right
+  * thing with wraparound XID arithmetic.  However, if we are asked about
+  * page number zero, we don't want to hand InvalidTransactionId to
+  * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+  * offset both xids by FirstNormalTransactionId to avoid that.
+  */
+ static bool
+ CommitTsPagePrecedes(int page1, int page2)
+ {
+ 	TransactionId xid1;
+ 	TransactionId xid2;
+ 
+ 	xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
+ 	xid1 += FirstNormalTransactionId;
+ 	xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
+ 	xid2 += FirstNormalTransactionId;
+ 
+ 	return TransactionIdPrecedes(xid1, xid2);
+ }
+ 
+ 
+ /*
+  * Write a ZEROPAGE xlog record
+  */
+ static void
+ WriteZeroPageXlogRec(int pageno)
+ {
+ 	XLogBeginInsert();
+ 	XLogRegisterData((char *) (&pageno), sizeof(int));
+ 	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_ZEROPAGE);
+ }
+ 
+ /*
+  * Write a TRUNCATE xlog record
+  */
+ static void
+ WriteTruncateXlogRec(int pageno)
+ {
+ 	XLogBeginInsert();
+ 	XLogRegisterData((char *) (&pageno), sizeof(int));
+ 	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_TRUNCATE);
+ }
+ 
+ /*
+  * Write a SETTS xlog record
+  */
+ static void
+ WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+ 						 TransactionId *subxids, TimestampTz timestamp,
+ 						 CommitTsNodeId nodeid)
+ {
+ 	xl_commit_ts_set	record;
+ 
+ 	record.timestamp = timestamp;
+ 	record.nodeid = nodeid;
+ 	record.mainxid = mainxid;
+ 
+ 	XLogBeginInsert();
+ 	XLogRegisterData((char *) &record,
+ 					 offsetof(xl_commit_ts_set, mainxid) +
+ 					 sizeof(TransactionId));
+ 	XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
+ 	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_SETTS);
+ }
+ 
+ /*
+  * CommitTS resource manager's routines
+  */
+ void
+ commit_ts_redo(XLogReaderState *record)
+ {
+ 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ 
+ 	/* Backup blocks are not used in commit_ts records */
+ 	Assert(!XLogRecHasAnyBlockRefs(record));
+ 
+ 	if (info == COMMIT_TS_ZEROPAGE)
+ 	{
+ 		int			pageno;
+ 		int			slotno;
+ 
+ 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ 
+ 		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+ 
+ 		slotno = ZeroCommitTsPage(pageno, false);
+ 		SimpleLruWritePage(CommitTsCtl, slotno);
+ 		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+ 
+ 		LWLockRelease(CommitTsControlLock);
+ 	}
+ 	else if (info == COMMIT_TS_TRUNCATE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ 
+ 		/*
+ 		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+ 		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+ 		 */
+ 		CommitTsCtl->shared->latest_page_number = pageno;
+ 
+ 		SimpleLruTruncate(CommitTsCtl, pageno);
+ 	}
+ 	else if (info == COMMIT_TS_SETTS)
+ 	{
+ 		xl_commit_ts_set *setts = (xl_commit_ts_set *) XLogRecGetData(record);
+ 		int			nsubxids;
+ 		TransactionId *subxids;
+ 
+ 		nsubxids = ((XLogRecGetDataLen(record) - SizeOfCommitTsSet) /
+ 					sizeof(TransactionId));
+ 		if (nsubxids > 0)
+ 		{
+ 			subxids = palloc(sizeof(TransactionId) * nsubxids);
+ 			memcpy(subxids,
+ 				   XLogRecGetData(record) + SizeOfCommitTsSet,
+ 				   sizeof(TransactionId) * nsubxids);
+ 		}
+ 		else
+ 			subxids = NULL;
+ 
+ 		TransactionTreeSetCommitTsData(setts->mainxid, nsubxids, subxids,
+ 									   setts->timestamp, setts->nodeid, false);
+ 		if (subxids)
+ 			pfree(subxids);
+ 	}
+ 	else
+ 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 8,13 ****
--- 8,14 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/gin.h"
  #include "access/gist_private.h"
  #include "access/hash.h"
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 1297,1303 **** SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
  
  		len = strlen(clde->d_name);
  
! 		if ((len == 4 || len == 5) &&
  			strspn(clde->d_name, "0123456789ABCDEF") == len)
  		{
  			segno = (int) strtol(clde->d_name, NULL, 16);
--- 1297,1303 ----
  
  		len = strlen(clde->d_name);
  
! 		if ((len == 4 || len == 5 || len == 6) &&
  			strspn(clde->d_name, "0123456789ABCDEF") == len)
  		{
  			segno = (int) strtol(clde->d_name, NULL, 16);
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 14,19 ****
--- 14,20 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
***************
*** 158,166 **** GetNewTransactionId(bool isSubXact)
  	 * XID before we zero the page.  Fortunately, a page of the commit log
  	 * holds 32K or more transactions, so we don't have to do this very often.
  	 *
! 	 * Extend pg_subtrans too.
  	 */
  	ExtendCLOG(xid);
  	ExtendSUBTRANS(xid);
  
  	/*
--- 159,168 ----
  	 * XID before we zero the page.  Fortunately, a page of the commit log
  	 * holds 32K or more transactions, so we don't have to do this very often.
  	 *
! 	 * Extend pg_subtrans and pg_commit_ts too.
  	 */
  	ExtendCLOG(xid);
+ 	ExtendCommitTs(xid);
  	ExtendSUBTRANS(xid);
  
  	/*
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 20,25 ****
--- 20,26 ----
  #include <time.h>
  #include <unistd.h>
  
+ #include "access/commit_ts.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
  #include "access/transam.h"
***************
*** 1135,1140 **** RecordTransactionCommit(void)
--- 1136,1156 ----
  	}
  
  	/*
+ 	 * We only need to log the commit timestamp separately if the node
+ 	 * identifier is a valid value; the commit record above already contains
+ 	 * the timestamp info otherwise, and will be used to load it.
+ 	 */
+ 	if (markXidCommitted)
+ 	{
+ 		CommitTsNodeId		node_id;
+ 
+ 		node_id = CommitTsGetDefaultNodeId();
+ 		TransactionTreeSetCommitTsData(xid, nchildren, children,
+ 									   xactStopTimestamp,
+ 									   node_id, node_id != InvalidCommitTsNodeId);
+ 	}
+ 
+ 	/*
  	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
  	 * to happen asynchronously if synchronous_commit=off, or if the current
  	 * transaction has not performed any WAL-logged operation.  The latter
***************
*** 4644,4649 **** xactGetCommittedChildren(TransactionId **ptr)
--- 4660,4666 ----
   */
  static void
  xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+ 						  TimestampTz commit_time,
  						  TransactionId *sub_xids, int nsubxacts,
  						  SharedInvalidationMessage *inval_msgs, int nmsgs,
  						  RelFileNode *xnodes, int nrels,
***************
*** 4671,4676 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
--- 4688,4697 ----
  		LWLockRelease(XidGenLock);
  	}
  
+ 	/* Set the transaction commit timestamp and metadata */
+ 	TransactionTreeSetCommitTsData(xid, nsubxacts, sub_xids,
+ 								   commit_time, InvalidCommitTsNodeId, false);
+ 
  	if (standbyState == STANDBY_DISABLED)
  	{
  		/*
***************
*** 4790,4796 **** xact_redo_commit(xl_xact_commit *xlrec,
  	/* invalidation messages array follows subxids */
  	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
  
! 	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
  							  inval_msgs, xlrec->nmsgs,
  							  xlrec->xnodes, xlrec->nrels,
  							  xlrec->dbId,
--- 4811,4818 ----
  	/* invalidation messages array follows subxids */
  	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
  
! 	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
! 							  subxacts, xlrec->nsubxacts,
  							  inval_msgs, xlrec->nmsgs,
  							  xlrec->xnodes, xlrec->nrels,
  							  xlrec->dbId,
***************
*** 4805,4811 **** static void
  xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
  						 TransactionId xid, XLogRecPtr lsn)
  {
! 	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
  							  NULL, 0,	/* inval msgs */
  							  NULL, 0,	/* relfilenodes */
  							  InvalidOid,		/* dbId */
--- 4827,4834 ----
  xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
  						 TransactionId xid, XLogRecPtr lsn)
  {
! 	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
! 							  xlrec->subxacts, xlrec->nsubxacts,
  							  NULL, 0,	/* inval msgs */
  							  NULL, 0,	/* relfilenodes */
  							  InvalidOid,		/* dbId */
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 22,27 ****
--- 22,28 ----
  #include <unistd.h>
  
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/multixact.h"
  #include "access/rewriteheap.h"
  #include "access/subtrans.h"
***************
*** 4518,4523 **** BootStrapXLOG(void)
--- 4519,4525 ----
  	checkPoint.oldestXidDB = TemplateDbOid;
  	checkPoint.oldestMulti = FirstMultiXactId;
  	checkPoint.oldestMultiDB = TemplateDbOid;
+ 	checkPoint.oldestCommitTs = InvalidTransactionId;
  	checkPoint.time = (pg_time_t) time(NULL);
  	checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 4527,4532 **** BootStrapXLOG(void)
--- 4529,4535 ----
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+ 	SetCommitTsLimit(InvalidTransactionId);
  
  	/* Set up the XLOG page header */
  	page->xlp_magic = XLOG_PAGE_MAGIC;
***************
*** 4606,4611 **** BootStrapXLOG(void)
--- 4609,4615 ----
  	ControlFile->max_locks_per_xact = max_locks_per_xact;
  	ControlFile->wal_level = wal_level;
  	ControlFile->wal_log_hints = wal_log_hints;
+ 	ControlFile->track_commit_timestamp = track_commit_timestamp;
  	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
  
  	/* some additional ControlFile fields are set in WriteControlFile() */
***************
*** 4614,4619 **** BootStrapXLOG(void)
--- 4618,4624 ----
  
  	/* Bootstrap the commit log, too */
  	BootStrapCLOG();
+ 	BootStrapCommitTs();
  	BootStrapSUBTRANS();
  	BootStrapMultiXact();
  
***************
*** 5865,5870 **** StartupXLOG(void)
--- 5870,5878 ----
  	ereport(DEBUG1,
  			(errmsg("oldest MultiXactId: %u, in database %u",
  					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+ 	ereport(DEBUG1,
+ 			(errmsg("oldest commit timestamp Xid: %u",
+ 					checkPoint.oldestCommitTs)));
  	if (!TransactionIdIsNormal(checkPoint.nextXid))
  		ereport(PANIC,
  				(errmsg("invalid next transaction ID")));
***************
*** 5876,5881 **** StartupXLOG(void)
--- 5884,5890 ----
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+ 	SetCommitTsLimit(checkPoint.oldestCommitTs);
  	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
  	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
  	XLogCtl->ckptXid = checkPoint.nextXid;
***************
*** 6098,6108 **** StartupXLOG(void)
  			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
  
  			/*
! 			 * Startup commit log and subtrans only. MultiXact has already
! 			 * been started up and other SLRUs are not maintained during
! 			 * recovery and need not be started yet.
  			 */
  			StartupCLOG();
  			StartupSUBTRANS(oldestActiveXID);
  
  			/*
--- 6107,6118 ----
  			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
  
  			/*
! 			 * Startup commit log, commit timestamp and subtrans only.
! 			 * MultiXact has already been started up and other SLRUs are not
! 			 * maintained during recovery and need not be started yet.
  			 */
  			StartupCLOG();
+ 			StartupCommitTs();
  			StartupSUBTRANS(oldestActiveXID);
  
  			/*
***************
*** 6751,6762 **** StartupXLOG(void)
  	LWLockRelease(ProcArrayLock);
  
  	/*
! 	 * Start up the commit log and subtrans, if not already done for hot
! 	 * standby.
  	 */
  	if (standbyState == STANDBY_DISABLED)
  	{
  		StartupCLOG();
  		StartupSUBTRANS(oldestActiveXID);
  	}
  
--- 6761,6773 ----
  	LWLockRelease(ProcArrayLock);
  
  	/*
! 	 * Start up the commit log, commit timestamp and subtrans, if not already
! 	 * done for hot standby.
  	 */
  	if (standbyState == STANDBY_DISABLED)
  	{
  		StartupCLOG();
+ 		StartupCommitTs();
  		StartupSUBTRANS(oldestActiveXID);
  	}
  
***************
*** 6792,6797 **** StartupXLOG(void)
--- 6803,6814 ----
  	XLogReportParameters();
  
  	/*
+ 	 * Local WAL inserts enabled, so it's time to finish initialization
+ 	 * of commit timestamp.
+ 	 */
+ 	CompleteCommitTsInitialization();
+ 
+ 	/*
  	 * All done.  Allow backends to write WAL.  (Although the bool flag is
  	 * probably atomic in itself, we use the info_lck here to ensure that
  	 * there are no race conditions concerning visibility of other recent
***************
*** 7358,7363 **** ShutdownXLOG(int code, Datum arg)
--- 7375,7381 ----
  		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
  	}
  	ShutdownCLOG();
+ 	ShutdownCommitTs();
  	ShutdownSUBTRANS();
  	ShutdownMultiXact();
  
***************
*** 7684,7689 **** CreateCheckPoint(int flags)
--- 7702,7711 ----
  	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
  	LWLockRelease(XidGenLock);
  
+ 	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+ 	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+ 	LWLockRelease(CommitTsControlLock);
+ 
  	/* Increase XID epoch if we've wrapped around since last checkpoint */
  	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
***************
*** 7961,7966 **** static void
--- 7983,7989 ----
  CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
  {
  	CheckPointCLOG();
+ 	CheckPointCommitTs();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
  	CheckPointPredicate();
***************
*** 8389,8395 **** XLogReportParameters(void)
  		MaxConnections != ControlFile->MaxConnections ||
  		max_worker_processes != ControlFile->max_worker_processes ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
--- 8412,8419 ----
  		MaxConnections != ControlFile->MaxConnections ||
  		max_worker_processes != ControlFile->max_worker_processes ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
! 		track_commit_timestamp != ControlFile->track_commit_timestamp)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
***************
*** 8409,8414 **** XLogReportParameters(void)
--- 8433,8439 ----
  			xlrec.max_locks_per_xact = max_locks_per_xact;
  			xlrec.wal_level = wal_level;
  			xlrec.wal_log_hints = wal_log_hints;
+ 			xlrec.track_commit_timestamp = track_commit_timestamp;
  
  			XLogBeginInsert();
  			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
***************
*** 8423,8428 **** XLogReportParameters(void)
--- 8448,8454 ----
  		ControlFile->max_locks_per_xact = max_locks_per_xact;
  		ControlFile->wal_level = wal_level;
  		ControlFile->wal_log_hints = wal_log_hints;
+ 		ControlFile->track_commit_timestamp = track_commit_timestamp;
  		UpdateControlFile();
  	}
  }
***************
*** 8799,8804 **** xlog_redo(XLogReaderState *record)
--- 8825,8831 ----
  		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
  		ControlFile->wal_level = xlrec.wal_level;
  		ControlFile->wal_log_hints = wal_log_hints;
+ 		ControlFile->track_commit_timestamp = track_commit_timestamp;
  
  		/*
  		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
*** a/src/backend/access/transam/xloginsert.c
--- b/src/backend/access/transam/xloginsert.c
***************
*** 299,305 **** XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
   * Add data to the WAL record that's being constructed.
   *
   * The data is appended to the "main chunk", available at replay with
!  * XLogGetRecData().
   */
  void
  XLogRegisterData(char *data, int len)
--- 299,305 ----
   * Add data to the WAL record that's being constructed.
   *
   * The data is appended to the "main chunk", available at replay with
!  * XLogRecGetData().
   */
  void
  XLogRegisterData(char *data, int len)
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***************
*** 23,28 ****
--- 23,29 ----
  #include <math.h>
  
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/htup_details.h"
***************
*** 1071,1080 **** vac_truncate_clog(TransactionId frozenXID,
  		return;
  
  	/*
! 	 * Truncate CLOG to the oldest computed value.  Note we don't truncate
! 	 * multixacts; that will be done by the next checkpoint.
  	 */
  	TruncateCLOG(frozenXID);
  
  	/*
  	 * Update the wrap limit for GetNewTransactionId and creation of new
--- 1072,1083 ----
  		return;
  
  	/*
! 	 * Truncate CLOG and CommitTs to the oldest computed value.
! 	 * Note we don't truncate multixacts; that will be done by the next
! 	 * checkpoint.
  	 */
  	TruncateCLOG(frozenXID);
+ 	TruncateCommitTs(frozenXID);
  
  	/*
  	 * Update the wrap limit for GetNewTransactionId and creation of new
***************
*** 1084,1089 **** vac_truncate_clog(TransactionId frozenXID,
--- 1087,1093 ----
  	 */
  	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
  	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+ 	SetCommitTsLimit(frozenXID);
  }
  
  
*** a/src/backend/libpq/hba.c
--- b/src/backend/libpq/hba.c
***************
*** 1440,1446 **** parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int line_num)
  				ereport(LOG,
  						(errcode(ERRCODE_CONFIG_FILE_ERROR),
  						 errmsg("client certificates can only be checked if a root certificate store is available"),
! 						 errhint("Make sure the configuration parameter \"ssl_ca_file\" is set."),
  						 errcontext("line %d of configuration file \"%s\"",
  									line_num, HbaFileName)));
  				return false;
--- 1440,1446 ----
  				ereport(LOG,
  						(errcode(ERRCODE_CONFIG_FILE_ERROR),
  						 errmsg("client certificates can only be checked if a root certificate store is available"),
! 						 errhint("Make sure the configuration parameter \"%s\" is set.", "ssl_ca_file"),
  						 errcontext("line %d of configuration file \"%s\"",
  									line_num, HbaFileName)));
  				return false;
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 133,138 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
--- 133,139 ----
  		case RM_SEQ_ID:
  		case RM_SPGIST_ID:
  		case RM_BRIN_ID:
+ 		case RM_COMMIT_TS_ID:
  			break;
  		case RM_NEXT_ID:
  			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
***************
*** 15,20 ****
--- 15,21 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/heapam.h"
  #include "access/multixact.h"
  #include "access/nbtree.h"
***************
*** 117,122 **** CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
--- 118,124 ----
  		size = add_size(size, ProcGlobalShmemSize());
  		size = add_size(size, XLOGShmemSize());
  		size = add_size(size, CLOGShmemSize());
+ 		size = add_size(size, CommitTsShmemSize());
  		size = add_size(size, SUBTRANSShmemSize());
  		size = add_size(size, TwoPhaseShmemSize());
  		size = add_size(size, BackgroundWorkerShmemSize());
***************
*** 198,203 **** CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
--- 200,206 ----
  	 */
  	XLOGShmemInit();
  	CLOGShmemInit();
+ 	CommitTsShmemInit();
  	SUBTRANSShmemInit();
  	MultiXactShmemInit();
  	InitBufferPool();
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
***************
*** 29,34 ****
--- 29,35 ----
  #include "postgres.h"
  
  #include "access/clog.h"
+ #include "access/commit_ts.h"
  #include "access/multixact.h"
  #include "access/subtrans.h"
  #include "commands/async.h"
***************
*** 259,264 **** NumLWLocks(void)
--- 260,268 ----
  	/* clog.c needs one per CLOG buffer */
  	numLocks += CLOGShmemBuffers();
  
+ 	/* commit_ts.c needs one per CommitTs buffer */
+ 	numLocks += CommitTsShmemBuffers();
+ 
  	/* subtrans.c needs one per SubTrans buffer */
  	numLocks += NUM_SUBTRANS_BUFFERS;
  
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 26,31 ****
--- 26,32 ----
  #include <syslog.h>
  #endif
  
+ #include "access/commit_ts.h"
  #include "access/gin.h"
  #include "access/transam.h"
  #include "access/twophase.h"
***************
*** 826,831 **** static struct config_bool ConfigureNamesBool[] =
--- 827,841 ----
  		check_bonjour, NULL, NULL
  	},
  	{
+ 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+ 			gettext_noop("Collects transaction commit time."),
+ 			NULL
+ 		},
+ 		&track_commit_timestamp,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
  			gettext_noop("Enables SSL connections."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 228,233 ****
--- 228,235 ----
  
  #max_replication_slots = 0	# max number of replication slots
  				# (change requires restart)
+ #track_commit_timestamp = off	# collect timestamp of transaction commit
+ 				# (change requires restart)
  
  # - Master Server -
  
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
***************
*** 186,191 **** static const char *subdirs[] = {
--- 186,192 ----
  	"pg_xlog",
  	"pg_xlog/archive_status",
  	"pg_clog",
+ 	"pg_commit_ts",
  	"pg_dynshmem",
  	"pg_notify",
  	"pg_serial",
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 270,275 **** main(int argc, char *argv[])
--- 270,277 ----
  		   ControlFile.checkPointCopy.oldestMulti);
  	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
  		   ControlFile.checkPointCopy.oldestMultiDB);
+ 	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+ 		   ControlFile.checkPointCopy.oldestCommitTs);
  	printf(_("Time of latest checkpoint:            %s\n"),
  		   ckpttime_str);
  	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
***************
*** 300,305 **** main(int argc, char *argv[])
--- 302,309 ----
  		   ControlFile.max_prepared_xacts);
  	printf(_("Current max_locks_per_xact setting:   %d\n"),
  		   ControlFile.max_locks_per_xact);
+ 	printf(_("Current track_commit_timestamp setting: %s\n"),
+ 		   ControlFile.track_commit_timestamp ? _("on") : _("off"));
  	printf(_("Maximum data alignment:               %u\n"),
  		   ControlFile.maxAlign);
  	/* we don't print floatFormat since can't say much useful about it */
*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 63,68 **** static bool guessed = false;	/* T if we had to guess at any values */
--- 63,69 ----
  static const char *progname;
  static uint32 set_xid_epoch = (uint32) -1;
  static TransactionId set_xid = 0;
+ static TransactionId set_commit_ts = 0;
  static Oid	set_oid = 0;
  static MultiXactId set_mxid = 0;
  static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
***************
*** 112,118 **** main(int argc, char *argv[])
  	}
  
  
! 	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
  	{
  		switch (c)
  		{
--- 113,119 ----
  	}
  
  
! 	while ((c = getopt(argc, argv, "c:D:e:fl:m:no:O:x:")) != -1)
  	{
  		switch (c)
  		{
***************
*** 158,163 **** main(int argc, char *argv[])
--- 159,179 ----
  				}
  				break;
  
+ 			case 'c':
+ 				set_commit_ts = strtoul(optarg, &endptr, 0);
+ 				if (endptr == optarg || *endptr != '\0')
+ 				{
+ 					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+ 					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ 					exit(1);
+ 				}
+ 				if (set_commit_ts == 0)
+ 				{
+ 					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+ 					exit(1);
+ 				}
+ 				break;
+ 
  			case 'o':
  				set_oid = strtoul(optarg, &endptr, 0);
  				if (endptr == optarg || *endptr != '\0')
***************
*** 345,350 **** main(int argc, char *argv[])
--- 361,369 ----
  		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
  	}
  
+ 	if (set_commit_ts != 0)
+ 		ControlFile.checkPointCopy.oldestCommitTs = set_commit_ts;
+ 
  	if (set_oid != 0)
  		ControlFile.checkPointCopy.nextOid = set_oid;
  
***************
*** 539,544 **** GuessControlValues(void)
--- 558,564 ----
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.wal_log_hints = false;
+ 	ControlFile.track_commit_timestamp = false;
  	ControlFile.MaxConnections = 100;
  	ControlFile.max_worker_processes = 8;
  	ControlFile.max_prepared_xacts = 0;
***************
*** 621,626 **** PrintControlValues(bool guessed)
--- 641,648 ----
  		   ControlFile.checkPointCopy.oldestMulti);
  	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
  		   ControlFile.checkPointCopy.oldestMultiDB);
+ 	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+ 		   ControlFile.checkPointCopy.oldestCommitTs);
  	printf(_("Maximum data alignment:               %u\n"),
  		   ControlFile.maxAlign);
  	/* we don't print floatFormat since can't say much useful about it */
***************
*** 702,707 **** PrintNewControlValues()
--- 724,735 ----
  		printf(_("NextXID epoch:                        %u\n"),
  			   ControlFile.checkPointCopy.nextXidEpoch);
  	}
+ 
+ 	if (set_commit_ts != 0)
+ 	{
+ 		printf(_("oldestCommitTs:                       %u\n"),
+ 			   ControlFile.checkPointCopy.oldestCommitTs);
+ 	}
  }
  
  
***************
*** 739,744 **** RewriteControlFile(void)
--- 767,773 ----
  	 */
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.wal_log_hints = false;
+ 	ControlFile.track_commit_timestamp = false;
  	ControlFile.MaxConnections = 100;
  	ControlFile.max_worker_processes = 8;
  	ControlFile.max_prepared_xacts = 0;
***************
*** 1099,1104 **** usage(void)
--- 1128,1134 ----
  	printf(_("%s resets the PostgreSQL transaction log.\n\n"), progname);
  	printf(_("Usage:\n  %s [OPTION]... {[-D] DATADIR}\n\n"), progname);
  	printf(_("Options:\n"));
+ 	printf(_("  -c XID           set the oldest transaction with retrievable commit timestamp\n"));
  	printf(_("  -e XIDEPOCH      set next transaction ID epoch\n"));
  	printf(_("  -f               force update to be done\n"));
  	printf(_("  -l XLOGFILE      force minimum WAL starting location for new transaction log\n"));
*** /dev/null
--- b/src/include/access/commit_ts.h
***************
*** 0 ****
--- 1,70 ----
+ /*
+  * commit_ts.h
+  *
+  * PostgreSQL commit timestamp manager
+  *
+  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/access/commit_ts.h
+  */
+ #ifndef COMMIT_TS_H
+ #define COMMIT_TS_H
+ 
+ #include "access/xlog.h"
+ #include "datatype/timestamp.h"
+ #include "utils/guc.h"
+ 
+ 
+ extern PGDLLIMPORT bool	track_commit_timestamp;
+ 
+ extern bool check_track_commit_timestamp(bool *newval, void **extra,
+ 							 GucSource source);
+ 
+ typedef uint32 CommitTsNodeId;
+ #define InvalidCommitTsNodeId 0
+ 
+ extern void CommitTsSetDefaultNodeId(CommitTsNodeId nodeid);
+ extern CommitTsNodeId CommitTsGetDefaultNodeId(void);
+ extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+ 							   TransactionId *subxids, TimestampTz timestamp,
+ 							   CommitTsNodeId nodeid, bool do_xlog);
+ extern void TransactionIdGetCommitTsData(TransactionId xid,
+ 							 TimestampTz *ts, CommitTsNodeId *nodeid);
+ extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
+ 					  CommitTsNodeId *nodeid);
+ 
+ extern Size CommitTsShmemBuffers(void);
+ extern Size CommitTsShmemSize(void);
+ extern void CommitTsShmemInit(void);
+ extern void BootStrapCommitTs(void);
+ extern void StartupCommitTs(void);
+ extern void CompleteCommitTsInitialization(void);
+ extern void ShutdownCommitTs(void);
+ extern void CheckPointCommitTs(void);
+ extern void ExtendCommitTs(TransactionId newestXact);
+ extern void TruncateCommitTs(TransactionId oldestXact);
+ extern void SetCommitTsLimit(TransactionId oldestXact);
+ 
+ /* XLOG stuff */
+ #define COMMIT_TS_ZEROPAGE		0x00
+ #define COMMIT_TS_TRUNCATE		0x10
+ #define COMMIT_TS_SETTS			0x20
+ 
+ typedef struct xl_commit_ts_set
+ {
+ 	TimestampTz		timestamp;
+ 	CommitTsNodeId	nodeid;
+ 	TransactionId	mainxid;
+ 	/* subxact Xids follow */
+ } xl_commit_ts_set;
+ 
+ #define SizeOfCommitTsSet	(offsetof(xl_commit_ts_set, mainxid) + \
+ 							 sizeof(TransactionId))
+ 
+ 
+ extern void commit_ts_redo(XLogReaderState *record);
+ extern void commit_ts_desc(StringInfo buf, XLogReaderState *record);
+ extern const char *commit_ts_identify(uint8 info);
+ 
+ #endif   /* COMMITTS_H */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 24,30 ****
   * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
   */
  
! /* symbol name, textual name, redo, desc, startup, cleanup */
  PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
  PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
  PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
--- 24,30 ----
   * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
   */
  
! /* symbol name, textual name, redo, desc, identify, startup, cleanup */
  PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
  PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
  PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
***************
*** 43,45 **** PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_start
--- 43,46 ----
  PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
  PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
  PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
+ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
*** a/src/include/access/transam.h
--- b/src/include/access/transam.h
***************
*** 124,129 **** typedef struct VariableCacheData
--- 124,134 ----
  	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
  
  	/*
+ 	 * These fields are protected by CommitTsControlLock
+ 	 */
+ 	TransactionId oldestCommitTs;
+ 
+ 	/*
  	 * These fields are protected by ProcArrayLock.
  	 */
  	TransactionId latestCompletedXid;	/* newest XID that has committed or
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 186,191 **** typedef struct xl_parameter_change
--- 186,192 ----
  	int			max_locks_per_xact;
  	int			wal_level;
  	bool		wal_log_hints;
+ 	bool		track_commit_timestamp;
  } xl_parameter_change;
  
  /* logs restore point */
*** a/src/include/catalog/catversion.h
--- b/src/include/catalog/catversion.h
***************
*** 53,58 ****
   */
  
  /*							yyyymmddN */
! #define CATALOG_VERSION_NO	201411241
  
  #endif
--- 53,58 ----
   */
  
  /*							yyyymmddN */
! #define CATALOG_VERSION_NO	201411242
  
  #endif
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 46,51 **** typedef struct CheckPoint
--- 46,52 ----
  	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
  	Oid			oldestMultiDB;	/* database with minimum datminmxid */
  	pg_time_t	time;			/* time stamp of checkpoint */
+ 	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
  
  	/*
  	 * Oldest XID still running. This is only needed to initialize hot standby
***************
*** 177,182 **** typedef struct ControlFileData
--- 178,184 ----
  	int			max_worker_processes;
  	int			max_prepared_xacts;
  	int			max_locks_per_xact;
+ 	bool		track_commit_timestamp;
  
  	/*
  	 * This data is used to check for hardware-architecture compatibility of
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 3017,3022 **** DESCR("view two-phase transactions");
--- 3017,3028 ----
  DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
  DESCR("view members of a multixactid");
  
+ DATA(insert OID = 3581 ( pg_xact_commit_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_xact_commit_timestamp _null_ _null_ _null_ ));
+ DESCR("get commit timestamp of a transaction");
+ 
+ DATA(insert OID = 3583 ( pg_last_committed_xact PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184}" "{o,o}" "{xid,timestamp}" _null_ pg_last_committed_xact _null_ _null_ _null_ ));
+ DESCR("get transaction Id and commit timestamp of latest transaction commit");
+ 
  DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
  DESCR("get identification of SQL object");
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 127,133 **** extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  #define AutoFileLock				(&MainLWLockArray[35].lock)
  #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
  #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
! #define NUM_INDIVIDUAL_LWLOCKS		38
  
  /*
   * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
--- 127,136 ----
  #define AutoFileLock				(&MainLWLockArray[35].lock)
  #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
  #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
! #define CommitTsControlLock			(&MainLWLockArray[38].lock)
! #define CommitTsLock				(&MainLWLockArray[39].lock)
! 
! #define NUM_INDIVIDUAL_LWLOCKS		40
  
  /*
   * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
***************
*** 1187,1192 **** extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
--- 1187,1196 ----
  /* access/transam/multixact.c */
  extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
  
+ /* access/transam/committs.c */
+ extern Datum pg_xact_commit_timestamp(PG_FUNCTION_ARGS);
+ extern Datum pg_last_committed_xact(PG_FUNCTION_ARGS);
+ 
  /* catalogs/dependency.c */
  extern Datum pg_describe_object(PG_FUNCTION_ARGS);
  extern Datum pg_identify_object(PG_FUNCTION_ARGS);
*** a/src/test/Makefile
--- b/src/test/Makefile
***************
*** 12,17 **** subdir = src/test
  top_builddir = ../..
  include $(top_builddir)/src/Makefile.global
  
! SUBDIRS = regress isolation
  
  $(recurse)
--- 12,17 ----
  top_builddir = ../..
  include $(top_builddir)/src/Makefile.global
  
! SUBDIRS = regress isolation modules
  
  $(recurse)
*** /dev/null
--- b/src/test/modules/Makefile
***************
*** 0 ****
--- 1,10 ----
+ # src/test/modules/Makefile
+ 
+ subdir = src/test/modules
+ top_builddir = ../../..
+ include $(top_builddir)/src/Makefile.global
+ 
+ SUBDIRS = \
+ 		  commit_ts
+ 
+ $(recurse)
*** /dev/null
--- b/src/test/modules/commit_ts/.gitignore
***************
*** 0 ****
--- 1,5 ----
+ # Generated subdirectories
+ /log/
+ /isolation_output/
+ /regression_output/
+ /tmp_check/
*** /dev/null
--- b/src/test/modules/commit_ts/Makefile
***************
*** 0 ****
--- 1,39 ----
+ # Note: because we don't tell the Makefile there are any regression tests,
+ # we have to clean those result files explicitly
+ EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output
+ 
+ subdir = src/test/modules/commit_ts
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+ include $(top_srcdir)/contrib/contrib-global.mk
+ 
+ # We can't support installcheck because normally installcheck users don't have
+ # the required track_commit_timestamp on
+ installcheck:;
+ 
+ check: regresscheck
+ 
+ submake-regress:
+ 	$(MAKE) -C $(top_builddir)/src/test/regress all
+ 
+ submake-test_commit_ts:
+ 	$(MAKE) -C $(top_builddir)/src/test/modules/commit_ts
+ 
+ REGRESSCHECKS=commit_timestamp
+ 
+ regresscheck: all | submake-regress submake-test_commit_ts
+ 	$(MKDIR_P) regression_output
+ 	$(pg_regress_check) \
+ 	    --temp-config $(top_srcdir)/src/test/modules/commit_ts/commit_ts.conf \
+ 	    --temp-install=./tmp_check \
+ 	    --extra-install=src/test/modules/commit_ts \
+ 	    --outputdir=./regression_output \
+ 	    $(REGRESSCHECKS)
+ 
+ regresscheck-install-force: | submake-regress submake-test_commit_ts
+ 	$(pg_regress_installcheck) \
+ 	    --extra-install=src/test/modules/commit_ts \
+ 	    $(REGRESSCHECKS)
+ 
+ PHONY: submake-test_commit_ts submake-regress check \
+ 	regresscheck regresscheck-install-force
*** /dev/null
--- b/src/test/modules/commit_ts/commit_ts.conf
***************
*** 0 ****
--- 1 ----
+ track_commit_timestamp = on
\ No newline at end of file
\ No newline at end of file
*** /dev/null
--- b/src/test/modules/commit_ts/expected/commit_timestamp.out
***************
*** 0 ****
--- 1,33 ----
+ --
+ -- Commit Timestamp
+ --
+ CREATE TABLE committs_test(id serial, ts timestamptz default now());
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ SELECT id,
+        pg_xact_commit_timestamp(xmin) >= ts,
+        pg_xact_commit_timestamp(xmin) < now(),
+        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+ FROM committs_test
+ ORDER BY id;
+  id | ?column? | ?column? | ?column? 
+ ----+----------+----------+----------
+   1 | t        | t        | t
+   2 | t        | t        | t
+   3 | t        | t        | t
+ (3 rows)
+ 
+ DROP TABLE committs_test;
+ SELECT pg_xact_commit_timestamp('0'::xid);
+  pg_xact_commit_timestamp 
+ --------------------------
+  
+ (1 row)
+ 
+ SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+  ?column? | ?column? | ?column? 
+ ----------+----------+----------
+  t        | t        | t
+ (1 row)
+ 
*** /dev/null
--- b/src/test/modules/commit_ts/sql/commit_timestamp.sql
***************
*** 0 ****
--- 1,21 ----
+ --
+ -- Commit Timestamp
+ --
+ CREATE TABLE committs_test(id serial, ts timestamptz default now());
+ 
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ 
+ SELECT id,
+        pg_xact_commit_timestamp(xmin) >= ts,
+        pg_xact_commit_timestamp(xmin) < now(),
+        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+ FROM committs_test
+ ORDER BY id;
+ 
+ DROP TABLE committs_test;
+ 
+ SELECT pg_xact_commit_timestamp('0'::xid);
+ 
+ SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
*** /dev/null
--- b/src/test/regress/expected/commit_ts.out
***************
*** 0 ****
--- 1,28 ----
+ --
+ -- Commit Timestamp
+ --
+ SHOW track_commit_timestamp;
+  track_commit_timestamp 
+ ------------------------
+  off
+ (1 row)
+ 
+ CREATE TABLE committs_test(id serial, ts timestamptz default now());
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ SELECT id,
+        pg_xact_commit_timestamp(xmin) >= ts,
+        pg_xact_commit_timestamp(xmin) < now(),
+        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+ FROM committs_test
+ ORDER BY id;
+ ERROR:  could not get commit timestamp data
+ HINT:   Make sure the configuration parameter "track_commit_timestamp" is set.
+ DROP TABLE committs_test;
+ SELECT pg_xact_commit_timestamp('0'::xid);
+ ERROR:  could not get commit timestamp data
+ HINT:   Make sure the configuration parameter "track_commit_timestamp" is set.
+ SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ ERROR:  could not get commit timestamp data
+ HINT:   Make sure the configuration parameter "track_commit_timestamp" is set.
*** /dev/null
--- b/src/test/regress/expected/commit_ts_1.out
***************
*** 0 ****
--- 1,39 ----
+ --
+ -- Commit Timestamp
+ --
+ SHOW track_commit_timestamp;
+  track_commit_timestamp 
+ ------------------------
+  on
+ (1 row)
+ 
+ CREATE TABLE committs_test(id serial, ts timestamptz default now());
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ SELECT id,
+        pg_xact_commit_timestamp(xmin) >= ts,
+        pg_xact_commit_timestamp(xmin) < now(),
+        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+ FROM committs_test
+ ORDER BY id;
+  id | ?column? | ?column? | ?column? 
+ ----+----------+----------+----------
+   1 | t        | t        | t
+   2 | t        | t        | t
+   3 | t        | t        | t
+ (3 rows)
+ 
+ DROP TABLE committs_test;
+ SELECT pg_xact_commit_timestamp('0'::xid);
+  pg_xact_commit_timestamp 
+ --------------------------
+  
+ (1 row)
+ 
+ SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+  ?column? | ?column? | ?column? 
+ ----------+----------+----------
+  t        | t        | t
+ (1 row)
+ 
*** a/src/test/regress/parallel_schedule
--- b/src/test/regress/parallel_schedule
***************
*** 88,94 **** test: brin gin gist spgist privileges security_label collate matview lock replic
  # ----------
  # Another group of parallel tests
  # ----------
! test: alter_generic misc psql async
  
  # rules cannot run concurrently with any test that creates a view
  test: rules
--- 88,94 ----
  # ----------
  # Another group of parallel tests
  # ----------
! test: alter_generic misc psql async commit_ts
  
  # rules cannot run concurrently with any test that creates a view
  test: rules
*** a/src/test/regress/serial_schedule
--- b/src/test/regress/serial_schedule
***************
*** 110,115 **** test: alter_generic
--- 110,116 ----
  test: misc
  test: psql
  test: async
+ test: commit_ts
  test: rules
  test: event_trigger
  test: select_views
*** /dev/null
--- b/src/test/regress/sql/commit_ts.sql
***************
*** 0 ****
--- 1,23 ----
+ --
+ -- Commit Timestamp
+ --
+ SHOW track_commit_timestamp;
+ 
+ CREATE TABLE committs_test(id serial, ts timestamptz default now());
+ 
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ INSERT INTO committs_test DEFAULT VALUES;
+ 
+ SELECT id,
+        pg_xact_commit_timestamp(xmin) >= ts,
+        pg_xact_commit_timestamp(xmin) < now(),
+        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+ FROM committs_test
+ ORDER BY id;
+ 
+ DROP TABLE committs_test;
+ 
+ SELECT pg_xact_commit_timestamp('0'::xid);
+ 
+ SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
#119Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#118)
Re: tracking commit timestamps

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Can I check my understanding? Probably we cannot use this feature to calculate
the actual replication lag by, for example, comparing the result of
pg_last_committed_xact() in the master and that of
pg_last_xact_replay_timestamp()
in the standby. Because pg_last_xact_replay_timestamp() can return even
the timestamp of aborted transaction, but pg_last_committed_xact()
cannot. Right?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#119)
Re: tracking commit timestamps

Fujii Masao wrote:

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Well, when a transaction has not committed, nothing is written so on
reading we get all zeroes which corresponds to the timestamp you give.
So yeah, it is intentional. We could alternatively check pg_clog and
raise an error if the transaction is not marked either COMMITTED or
SUBCOMMITTED, but I'm not real sure there's much point.

The other option is to record a "commit" time for aborted transactions
too, but that doesn't seem very good either: first, this doesn't do
anything for crashed or for in-progress transactions; and second, how
does it make sense to have a "commit" time for a transaction that
doesn't actually commit?

Can I check my understanding? Probably we cannot use this feature to calculate
the actual replication lag by, for example, comparing the result of
pg_last_committed_xact() in the master and that of
pg_last_xact_replay_timestamp()
in the standby. Because pg_last_xact_replay_timestamp() can return even
the timestamp of aborted transaction, but pg_last_committed_xact()
cannot. Right?

I don't think it's suited for that. I guess if you recorded the time
of the last transaction that actually committed, you can use that.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#120)
Re: tracking commit timestamps

On Tue, Nov 25, 2014 at 11:19 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fujii Masao wrote:

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Well, when a transaction has not committed, nothing is written so on
reading we get all zeroes which corresponds to the timestamp you give.
So yeah, it is intentional. We could alternatively check pg_clog and
raise an error if the transaction is not marked either COMMITTED or
SUBCOMMITTED, but I'm not real sure there's much point.

The other option is to record a "commit" time for aborted transactions
too, but that doesn't seem very good either: first, this doesn't do
anything for crashed or for in-progress transactions; and second, how
does it make sense to have a "commit" time for a transaction that
doesn't actually commit?

What about the PREPARE and COMMIT PREPARED transactions?
ISTM that this feature tracks neither of them.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#120)
Re: tracking commit timestamps

On Tue, Nov 25, 2014 at 9:19 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fujii Masao wrote:

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Well, when a transaction has not committed, nothing is written so on
reading we get all zeroes which corresponds to the timestamp you give.
So yeah, it is intentional. We could alternatively check pg_clog and
raise an error if the transaction is not marked either COMMITTED or
SUBCOMMITTED, but I'm not real sure there's much point.

Maybe 0 should get translated to a NULL return, instead of a bogus timestamp.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#122)
Re: tracking commit timestamps

Robert Haas wrote:

On Tue, Nov 25, 2014 at 9:19 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fujii Masao wrote:

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Well, when a transaction has not committed, nothing is written so on
reading we get all zeroes which corresponds to the timestamp you give.
So yeah, it is intentional. We could alternatively check pg_clog and
raise an error if the transaction is not marked either COMMITTED or
SUBCOMMITTED, but I'm not real sure there's much point.

Maybe 0 should get translated to a NULL return, instead of a bogus timestamp.

That's one idea --- surely no transaction is going to commit at 00:00:00
on 2000-01-01 anymore. Yet this is somewhat discomforting.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124Petr Jelinek
petr@2ndquadrant.com
In reply to: Alvaro Herrera (#123)
Re: tracking commit timestamps

On 25/11/14 16:23, Alvaro Herrera wrote:

Robert Haas wrote:

On Tue, Nov 25, 2014 at 9:19 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fujii Masao wrote:

On Tue, Nov 25, 2014 at 7:58 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

And here is v10 which fixes conflicts with Heikki's WAL API changes (no
changes otherwise).

After some slight additional changes, here's v11, which I intend to
commit early tomorrow. The main change is moving the test module from
contrib to src/test/modules.

When I specify the XID of the aborted transaction in pg_xact_commit_timestamp(),
it always returns 2000-01-01 09:00:00+09. Is this intentional?

Well, when a transaction has not committed, nothing is written so on
reading we get all zeroes which corresponds to the timestamp you give.
So yeah, it is intentional. We could alternatively check pg_clog and
raise an error if the transaction is not marked either COMMITTED or
SUBCOMMITTED, but I'm not real sure there's much point.

Maybe 0 should get translated to a NULL return, instead of a bogus timestamp.

That's one idea --- surely no transaction is going to commit at 00:00:00
on 2000-01-01 anymore. Yet this is somewhat discomforting.

I solved it for xids that are out of range by returning -infinity and
then changing that to NULL in sql interface, but no idea how to do that
for aborted transactions.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#124)
Re: tracking commit timestamps

Petr Jelinek wrote:

On 25/11/14 16:23, Alvaro Herrera wrote:

Robert Haas wrote:

Maybe 0 should get translated to a NULL return, instead of a bogus timestamp.

That's one idea --- surely no transaction is going to commit at 00:00:00
on 2000-01-01 anymore. Yet this is somewhat discomforting.

I solved it for xids that are out of range by returning -infinity and then
changing that to NULL in sql interface, but no idea how to do that for
aborted transactions.

I guess the idea is that we just read the value from the slru and if it
exactly matches allballs we do the same -infinity return and translation
to NULL. (Do we really love this -infinity idea? If it's just an
internal API we can use a boolean instead.)

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126Petr Jelinek
petr@2ndquadrant.com
In reply to: Alvaro Herrera (#125)
Re: tracking commit timestamps

On 25/11/14 16:30, Alvaro Herrera wrote:

Petr Jelinek wrote:

On 25/11/14 16:23, Alvaro Herrera wrote:

Robert Haas wrote:

Maybe 0 should get translated to a NULL return, instead of a bogus timestamp.

That's one idea --- surely no transaction is going to commit at 00:00:00
on 2000-01-01 anymore. Yet this is somewhat discomforting.

I solved it for xids that are out of range by returning -infinity and then
changing that to NULL in sql interface, but no idea how to do that for
aborted transactions.

I guess the idea is that we just read the value from the slru and if it
exactly matches allballs we do the same -infinity return and translation
to NULL. (Do we really love this -infinity idea? If it's just an
internal API we can use a boolean instead.)

As in returning boolean instead of void as "found"? That works for me
(for the C interface).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#119)
Re: tracking commit timestamps

On 25 November 2014 at 13:35, Fujii Masao <masao.fujii@gmail.com> wrote:

Can I check my understanding? Probably we cannot use this feature to calculate
the actual replication lag by, for example, comparing the result of
pg_last_committed_xact() in the master and that of
pg_last_xact_replay_timestamp()
in the standby. Because pg_last_xact_replay_timestamp() can return even
the timestamp of aborted transaction, but pg_last_committed_xact()
cannot. Right?

It was intended for that, but I forgot that
pg_last_xact_replay_timestamp() includes abort as well.

I suggest we add a function that returns both the xid and timestamp on
the standby:
* pg_last_commit_replay_info() - which returns both the xid and
timestamp of the last commit replayed on standby
* then we use the xid from the standby to lookup the commit timestamp
on the master.
We then have two timestamps that refer to the same xid and can
subtract to give us an exact replication lag.

That can be done manually by user if requested.

We can also do that by sending the replay info back as a feedback
message from standby to master, so the information can be calculated
by pg_stat_replication when requested.

I'll work on that once we have this current patch committed.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128Petr Jelinek
petr@2ndquadrant.com
In reply to: Simon Riggs (#127)
Re: tracking commit timestamps

On 25/11/14 17:16, Simon Riggs wrote:

On 25 November 2014 at 13:35, Fujii Masao <masao.fujii@gmail.com> wrote:

Can I check my understanding? Probably we cannot use this feature to calculate
the actual replication lag by, for example, comparing the result of
pg_last_committed_xact() in the master and that of
pg_last_xact_replay_timestamp()
in the standby. Because pg_last_xact_replay_timestamp() can return even
the timestamp of aborted transaction, but pg_last_committed_xact()
cannot. Right?

It was intended for that, but I forgot that
pg_last_xact_replay_timestamp() includes abort as well.

I suggest we add a function that returns both the xid and timestamp on
the standby:
* pg_last_commit_replay_info() - which returns both the xid and
timestamp of the last commit replayed on standby
* then we use the xid from the standby to lookup the commit timestamp
on the master.
We then have two timestamps that refer to the same xid and can
subtract to give us an exact replication lag.

Won't the pg_last_committed_xact() on slave + pg_xact_commit_timestamp()
on master with the xid returned by slave accomplish the same thing?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129Simon Riggs
simon@2ndQuadrant.com
In reply to: Petr Jelinek (#128)
Re: tracking commit timestamps

On 25 November 2014 at 16:18, Petr Jelinek <petr@2ndquadrant.com> wrote:

Won't the pg_last_committed_xact() on slave + pg_xact_commit_timestamp() on
master with the xid returned by slave accomplish the same thing?

Surely the pg_last_committed_xact() will return the same value on
standby as it did on the master?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#129)
Re: tracking commit timestamps

On Wed, Nov 26, 2014 at 1:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 25 November 2014 at 16:18, Petr Jelinek <petr@2ndquadrant.com> wrote:

Won't the pg_last_committed_xact() on slave + pg_xact_commit_timestamp() on
master with the xid returned by slave accomplish the same thing?

Surely the pg_last_committed_xact() will return the same value on
standby as it did on the master?

It should. Now it needs some extra help as well as in its current
shape this patch will WAL log a commit timestamp if the Node ID is
valid, per RecordTransactionCommit. The node ID can be set only
through CommitTsSetDefaultNodeId, which is called nowhere actually. So
if an extension or an extra library needs to do some leg work to have
to allow this information to be replayed on other nodes.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#126)
1 attachment(s)
Re: tracking commit timestamps

Petr Jelinek wrote:

On 25/11/14 16:30, Alvaro Herrera wrote:

Petr Jelinek wrote:

On 25/11/14 16:23, Alvaro Herrera wrote:

Robert Haas wrote:

Maybe 0 should get translated to a NULL return, instead of a bogus timestamp.

That's one idea --- surely no transaction is going to commit at 00:00:00
on 2000-01-01 anymore. Yet this is somewhat discomforting.

I solved it for xids that are out of range by returning -infinity and then
changing that to NULL in sql interface, but no idea how to do that for
aborted transactions.

I guess the idea is that we just read the value from the slru and if it
exactly matches allballs we do the same -infinity return and translation
to NULL. (Do we really love this -infinity idea? If it's just an
internal API we can use a boolean instead.)

As in returning boolean instead of void as "found"? That works for me
(for the C interface).

Petr sent me privately some changes to implement this idea. I also
reworked the tests so that they only happen on src/test/modules (getting
rid of the one in core regress) and made them work with both enabled and
disabled track_commit_timestamps; in make installcheck, they pass
regardless of the setting of the installed server, and in make check,
they run a server with the setting enabled.

I made two more changes:
1. introduce newestCommitTs. Original code was using lastCommitXact to
check that no "future" transaction is asked for, but this doesn't really
work if a long-running transaction is committed, because asking for
transactions with a higher Xid but which were committed earlier would
raise an error.

2. change CommitTsControlLock to CommitTsLock as the lock that protects
the stuff we keep in ShmemVariableCache. The Control one is for SLRU
access, and so it might be slow at times. This is important because we
fill the checkpoint struct from values protected by that lock, so a
checkpoint might be delayed if it happens to land in the middle of a
slru IO operation.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

committs_v12.patchtext/x-diff; charset=us-asciiDownload
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..180818d 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include "access/brin_xlog.h"
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ab8c263..e3713d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,20 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
+      <term><varname>track_commit_timestamp</varname> (<type>bool</type>)</term>
+      <indexterm>
+       <primary><varname>track_commit_timestamp</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Record commit time of transactions. This parameter
+        can only be set in <filename>postgresql.conf</> file or on the server
+        command line. The default value is <literal>off</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index baf81ee..62ec275 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15938,6 +15938,45 @@ SELECT collation for ('foo' COLLATE "de_DE");
     For example <literal>10:20:10,14,15</literal> means
     <literal>xmin=10, xmax=20, xip_list=10, 14, 15</literal>.
    </para>
+
+   <para>
+    The functions shown in <xref linkend="functions-commit-timestamp">
+    provide information about transactions that have been already committed.
+    These functions mainly provide information about when the transactions
+    were committed. They only provide useful data when
+    <xref linkend="guc-track-commit-timestamp"> configuration option is enabled
+    and only for transactions that were committed after it was enabled.
+   </para>
+
+   <table id="functions-commit-timestamp">
+    <title>Committed transaction information</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <indexterm><primary>pg_xact_commit_timestamp</primary></indexterm>
+        <literal><function>pg_xact_commit_timestamp(<parameter>xid</parameter>)</function></literal>
+       </entry>
+       <entry><type>timestamp with time zone</type></entry>
+       <entry>get commit timestamp of a transaction</entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm><primary>pg_last_committed_xact</primary></indexterm>
+        <literal><function>pg_last_committed_xact()</function></literal>
+       </entry>
+       <entry><parameter>xid</> <type>xid</>, <parameter>timestamp</> <type>timestamp with time zone</></entry>
+       <entry>get transaction ID and commit timestamp of latest committed transaction</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   </sect1>
 
   <sect1 id="functions-admin">
diff --git a/doc/src/sgml/ref/pg_resetxlog.sgml b/doc/src/sgml/ref/pg_resetxlog.sgml
index aba7185..f97a052 100644
--- a/doc/src/sgml/ref/pg_resetxlog.sgml
+++ b/doc/src/sgml/ref/pg_resetxlog.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  <refsynopsisdiv>
   <cmdsynopsis>
    <command>pg_resetxlog</command>
+   <arg choice="opt"><option>-c</option> <replaceable class="parameter">xid</replaceable></arg>
    <arg choice="opt"><option>-f</option></arg>
    <arg choice="opt"><option>-n</option></arg>
    <arg choice="opt"><option>-o</option> <replaceable class="parameter">oid</replaceable></arg>
@@ -77,12 +78,12 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The <option>-o</>, <option>-x</>, <option>-e</>,
-   <option>-m</>, <option>-O</>,
-   and <option>-l</>
+   The <option>-o</>, <option>-x</>, <option>-m</>, <option>-O</>,
+   <option>-l</> and <option>-e</>
    options allow the next OID, next transaction ID, next transaction ID's
-   epoch, next and oldest multitransaction ID, next multitransaction offset, and WAL
-   starting address values to be set manually.  These are only needed when
+   epoch, next and oldest multitransaction ID, next multitransaction offset, WAL
+   starting address and the oldest transaction ID for which the commit time can
+   be retrieved values to be set manually.  These are only needed when
    <command>pg_resetxlog</command> is unable to determine appropriate values
    by reading <filename>pg_control</>.  Safe values can be determined as
    follows:
@@ -130,6 +131,15 @@ PostgreSQL documentation
 
     <listitem>
      <para>
+      A safe value for the oldest transaction ID for which the commit time can
+      be retrieved (<option>-c</>) can be determined by looking for the
+      numerically smallest file name in the directory <filename>pg_committs</>
+      under the data directory.  As above, the file names are in hexadecimal.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
       The WAL starting address (<option>-l</>) should be
       larger than any WAL segment file name currently existing in
       the directory <filename>pg_xlog</> under the data directory.
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 920b5f0..cb76b98 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -67,6 +67,11 @@ Item
 </row>
 
 <row>
+ <entry><filename>pg_commit_ts</></entry>
+ <entry>Subdirectory containing transaction commit timestamp data</entry>
+</row>
+
+<row>
  <entry><filename>pg_clog</></entry>
  <entry>Subdirectory containing transaction commit status data</entry>
 </row>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 32cb985..d18e8ec 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o dbasedesc.o gindesc.o gistdesc.o \
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
new file mode 100644
index 0000000..a10c7df
--- /dev/null
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -0,0 +1,82 @@
+/*-------------------------------------------------------------------------
+ *
+ * committsdesc.c
+ *    rmgr descriptor routines for access/transam/committs.c
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *    src/backend/access/rmgrdesc/committsdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/commit_ts.h"
+#include "utils/timestamp.h"
+
+
+void
+commit_ts_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "%d", pageno);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, rec, sizeof(int));
+		appendStringInfo(buf, "%d", pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *xlrec = (xl_commit_ts_set *) rec;
+		int		nsubxids;
+
+		appendStringInfo(buf, "set %s/%d for: %u",
+						 timestamptz_to_str(xlrec->timestamp),
+						 xlrec->nodeid,
+						 xlrec->mainxid);
+		nsubxids = ((XLogRecGetDataLen(record) - SizeOfCommitTsSet) /
+					sizeof(TransactionId));
+		if (nsubxids > 0)
+		{
+			int		i;
+			TransactionId *subxids;
+
+			subxids = palloc(sizeof(TransactionId) * nsubxids);
+			memcpy(subxids,
+				   XLogRecGetData(record) + SizeOfCommitTsSet,
+				   sizeof(TransactionId) * nsubxids);
+			for (i = 0; i < nsubxids; i++)
+				appendStringInfo(buf, ", %u", subxids[i]);
+			pfree(subxids);
+		}
+	}
+}
+
+const char *
+commit_ts_identify(uint8 info)
+{
+	switch (info)
+	{
+		case COMMIT_TS_ZEROPAGE:
+			return "ZEROPAGE";
+		case COMMIT_TS_TRUNCATE:
+			return "TRUNCATE";
+		case COMMIT_TS_SETTS:
+			return "SETTS";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index eba046d..4d1fe43 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -45,7 +45,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "redo %X/%X; "
 						 "tli %u; prev tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest multi %u in DB %u; "
-						 "oldest running xid %u; %s",
+						 "oldest commit timestamp xid: %u; oldest running xid %u; %s",
 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->PrevTimeLineID,
@@ -58,6 +58,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestXidDB,
 						 checkpoint->oldestMulti,
 						 checkpoint->oldestMultiDB,
+						 checkpoint->oldestCommitTs,
 						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 82a6c76..9d4d5db 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,8 +12,9 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
-	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
+OBJS = clog.o commit_ts.o multixact.o rmgr.o slru.o subtrans.o \
+	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
+	xact.o xlog.o xlogarchive.o xlogfuncs.o \
 	xloginsert.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b619de5..bc68b47 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -840,7 +840,7 @@ parent transaction to complete.
 
 Not all transactional behaviour is emulated, for example we do not insert
 a transaction entry into the lock table, nor do we maintain the transaction
-stack in memory. Clog and multixact entries are made normally.
+stack in memory. Clog, multixact and commit_ts entries are made normally.
 Subtrans is maintained during recovery but the details of the transaction
 tree are ignored and all subtransactions reference the top-level TransactionId
 directly. Since commit is atomic this provides correct lock wait behaviour
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 313bd04..cb7ef28 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -419,7 +419,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  *
  * Testing during the PostgreSQL 9.2 development cycle revealed that on a
  * large multi-processor system, it was possible to have more CLOG page
- * requests in flight at one time than the numebr of CLOG buffers which existed
+ * requests in flight at one time than the number of CLOG buffers which existed
  * at that time, which was hardcoded to 8.  Further testing revealed that
  * performance dropped off with more than 32 CLOG buffers, possibly because
  * the linear buffer search algorithm doesn't scale well.
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
new file mode 100644
index 0000000..d3085ac
--- /dev/null
+++ b/src/backend/access/transam/commit_ts.c
@@ -0,0 +1,902 @@
+/*-------------------------------------------------------------------------
+ *
+ * commit_ts.c
+ *		PostgreSQL commit timestamp manager
+ *
+ * This module is a pg_clog-like system that stores the commit timestamp
+ * for each transaction.
+ *
+ * XLOG interactions: this module generates an XLOG record whenever a new
+ * CommitTs page is initialized to zeroes.  Also, one XLOG record is
+ * generated for setting of values when the caller requests it; this allows
+ * us to support values coming from places other than transaction commit.
+ * Other writes of CommitTS come from recording of transaction commit in
+ * xact.c, which generates its own XLOG records for these events and will
+ * re-perform the status update on redo; so we need make no additional XLOG
+ * entry here.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/commit_ts.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/commit_ts.h"
+#include "access/htup_details.h"
+#include "access/slru.h"
+#include "access/transam.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * CommitTs page numbering also wraps around at
+ * 0xFFFFFFFF/COMMIT_TS_XACTS_PER_PAGE, and CommitTs segment numbering at
+ * 0xFFFFFFFF/COMMIT_TS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
+ */
+
+/*
+ * We need 8+4 bytes per xact.  Note that enlarging this struct might mean
+ * the largest possible file name is more than 5 chars long; see
+ * SlruScanDirectory.
+ */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz		time;
+	CommitTsNodeId	nodeid;
+} CommitTimestampEntry;
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
+									sizeof(CommitTsNodeId))
+
+#define COMMIT_TS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define TransactionIdToCTsPage(xid)	\
+	((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+#define TransactionIdToCTsEntry(xid)	\
+	((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CommitTs control
+ */
+static SlruCtlData CommitTsCtlData;
+
+#define CommitTsCtl (&CommitTsCtlData)
+
+/*
+ * We keep a cache of the last value set in shared memory.  This is protected
+ * by CommitTsLock.
+ */
+typedef struct CommitTimestampShared
+{
+	TransactionId	xidLastCommit;
+	CommitTimestampEntry dataLastCommit;
+} CommitTimestampShared;
+
+CommitTimestampShared	*commitTsShared;
+
+
+/* GUC variable */
+bool	track_commit_timestamp;
+
+static CommitTsNodeId default_node_id = InvalidCommitTsNodeId;
+
+static void SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 CommitTsNodeId nodeid, int pageno);
+static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						  CommitTsNodeId nodeid, int slotno);
+static int	ZeroCommitTsPage(int pageno, bool writeXlog);
+static bool CommitTsPagePrecedes(int page1, int page2);
+static void WriteZeroPageXlogRec(int pageno);
+static void WriteTruncateXlogRec(int pageno);
+static void WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitTsNodeId nodeid);
+
+
+/*
+ * CommitTsSetDefaultNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+void
+CommitTsSetDefaultNodeId(CommitTsNodeId nodeid)
+{
+	default_node_id = nodeid;
+}
+
+/*
+ * CommitTsGetDefaultNodeId
+ *
+ * Set default nodeid for current backend.
+ */
+CommitTsNodeId
+CommitTsGetDefaultNodeId(void)
+{
+	return default_node_id;
+}
+
+/*
+ * TransactionTreeSetCommitTsData
+ *
+ * Record the final commit timestamp of transaction entries in the commit log
+ * for a transaction and its subtransaction tree, as efficiently as possible.
+ *
+ * xid is the top level transaction id.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ * The reason why tracking just the parent xid commit timestamp is not enough
+ * is that the subtrans SLRU does not stay valid across crashes (it's not
+ * permanent) so we need to keep the information about them here. If the
+ * subtrans implementation changes in the future, we might want to revisit the
+ * decision of storing timestamp info for each subxid.
+ *
+ * The do_xlog parameter tells us whether to include a XLog record of this
+ * or not.  Normal path through RecordTransactionCommit() will be related
+ * to a transaction commit XLog record, and so should pass "false" here.
+ * Other callers probably want to pass true, so that the given values persist
+ * in case of crashes.
+ */
+void
+TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+							   TransactionId *subxids, TimestampTz timestamp,
+							   CommitTsNodeId nodeid, bool do_xlog)
+{
+	int			i;
+	TransactionId headxid;
+	TransactionId newestXact;
+
+	Assert(xid != InvalidTransactionId);
+
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * Comply with the WAL-before-data rule: if caller specified it wants
+	 * this value to be recorded in WAL, do so before touching the data.
+	 */
+	if (do_xlog)
+		WriteSetTimestampXlogRec(xid, nsubxids, subxids, timestamp, nodeid);
+
+	/*
+	 * Figure out the latest Xid in this batch: either the last subxid if
+	 * there's any, otherwise the parent xid.
+	 */
+	if (nsubxids > 0)
+		newestXact = subxids[nsubxids - 1];
+	else
+		newestXact = xid;
+
+	/*
+	 * We split the xids to set the timestamp to in groups belonging to the
+	 * same SLRU page; the first element in each such set is its head.  The
+	 * first group has the main XID as the head; subsequent sets use the
+	 * first subxid not on the previous page as head.  This way, we only have
+	 * to lock/modify each SLRU page once.
+	 */
+	for (i = 0, headxid = xid;;)
+	{
+		int			pageno = TransactionIdToCTsPage(headxid);
+		int			j;
+
+		for (j = i; j < nsubxids; j++)
+		{
+			if (TransactionIdToCTsPage(subxids[j]) != pageno)
+				break;
+		}
+		/* subxids[i..j] are on the same page as the head */
+
+		SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, nodeid,
+							 pageno);
+
+		/* if we wrote out all subxids, we're done. */
+		if (j + 1 >= nsubxids)
+			break;
+
+		/*
+		 * Set the new head and skip over it, as well as over the subxids
+		 * we just wrote.
+		 */
+		headxid = subxids[j];
+		i += j - i + 1;
+	}
+
+	/*
+	 * Update the cached value in shared memory
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	commitTsShared->xidLastCommit = xid;
+	commitTsShared->dataLastCommit.time = timestamp;
+	commitTsShared->dataLastCommit.nodeid = nodeid;
+	LWLockRelease(CommitTsLock);
+
+	/* and move forwards our endpoint, if needed */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	if (TransactionIdPrecedes(ShmemVariableCache->newestCommitTs, newestXact))
+		ShmemVariableCache->newestCommitTs = newestXact;
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Record the commit timestamp of transaction entries in the commit log for all
+ * entries on a single page.  Atomic only on this page.
+ */
+static void
+SetXidCommitTsInPage(TransactionId xid, int nsubxids,
+					 TransactionId *subxids, TimestampTz ts,
+					 CommitTsNodeId nodeid, int pageno)
+{
+	int			slotno;
+	int			i;
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
+
+	TransactionIdSetCommitTs(xid, ts, nodeid, slotno);
+	for (i = 0; i < nsubxids; i++)
+		TransactionIdSetCommitTs(subxids[i], ts, nodeid, slotno);
+
+	CommitTsCtl->shared->page_dirty[slotno] = true;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Sets the commit timestamp of a single transaction.
+ *
+ * Must be called with CommitTsControlLock held
+ */
+static void
+TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
+						 CommitTsNodeId nodeid, int slotno)
+{
+	int			entryno = TransactionIdToCTsEntry(xid);
+	CommitTimestampEntry entry;
+
+	entry.time = ts;
+	entry.nodeid = nodeid;
+
+	memcpy(CommitTsCtl->shared->page_buffer[slotno] +
+		   SizeOfCommitTimestampEntry * entryno,
+		   &entry, SizeOfCommitTimestampEntry);
+}
+
+/*
+ * Interrogate the commit timestamp of a transaction.
+ *
+ * Return value indicates whether commit timestamp record was found for
+ * given xid.
+ */
+bool
+TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
+							 CommitTsNodeId *nodeid)
+{
+	int			pageno = TransactionIdToCTsPage(xid);
+	int			entryno = TransactionIdToCTsEntry(xid);
+	int			slotno;
+	CommitTimestampEntry entry;
+	TransactionId oldestCommitTs;
+	TransactionId newestCommitTs;
+
+	/* Error if module not enabled */
+	if (!track_commit_timestamp)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("could not get commit timestamp data"),
+				 errhint("Make sure the configuration parameter \"%s\" is set.",
+						 "track_commit_timestamp")));
+
+	Assert(TransactionIdIsNormal(xid));
+
+	/*
+	 * Return empty if the requested value is outside our valid range.
+	 */
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	newestCommitTs = ShmemVariableCache->newestCommitTs;
+	/* neither is invalid, or both are */
+	Assert(TransactionIdIsValid(oldestCommitTs) == TransactionIdIsValid(newestCommitTs));
+	LWLockRelease(CommitTsLock);
+
+	if (!TransactionIdIsValid(oldestCommitTs) ||
+		TransactionIdPrecedes(xid, oldestCommitTs) ||
+		TransactionIdPrecedes(newestCommitTs, xid))
+	{
+		if (ts)
+			*ts = 0;
+		if (nodeid)
+			*nodeid = InvalidCommitTsNodeId;
+		return false;
+	}
+
+	/*
+	 * Use an unlocked atomic read on our cached value in shared memory; if
+	 * it's a hit, acquire a lock and read the data, after verifying that it's
+	 * still what we initially read.  Otherwise, fall through to read from
+	 * SLRU.
+	 */
+	if (commitTsShared->xidLastCommit == xid)
+	{
+		LWLockAcquire(CommitTsLock, LW_SHARED);
+		if (commitTsShared->xidLastCommit == xid)
+		{
+			if (ts)
+				*ts = commitTsShared->dataLastCommit.time;
+			if (nodeid)
+				*nodeid = commitTsShared->dataLastCommit.nodeid;
+
+			LWLockRelease(CommitTsLock);
+			return *ts != 0;
+		}
+		LWLockRelease(CommitTsLock);
+	}
+
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	memcpy(&entry,
+		   CommitTsCtl->shared->page_buffer[slotno] +
+		   SizeOfCommitTimestampEntry * entryno,
+		   SizeOfCommitTimestampEntry);
+
+	if (ts)
+		*ts = entry.time;
+	if (nodeid)
+		*nodeid = entry.nodeid;
+
+	LWLockRelease(CommitTsControlLock);
+	return *ts != 0;
+}
+
+/*
+ * Return the Xid of the latest committed transaction.  (As far as this module
+ * is concerned, anyway; it's up to the caller to ensure the value is useful
+ * for its purposes.)
+ *
+ * ts and extra are filled with the corresponding data; they can be passed
+ * as NULL if not wanted.
+ */
+TransactionId
+GetLatestCommitTsData(TimestampTz *ts, CommitTsNodeId *nodeid)
+{
+	TransactionId	xid;
+
+	/* Error if module not enabled */
+	if (!track_commit_timestamp)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("could not get commit timestamp data"),
+				 errhint("Make sure the configuration parameter \"%s\" is set.",
+						 "track_commit_timestamp")));
+
+	LWLockAcquire(CommitTsLock, LW_SHARED);
+	xid = commitTsShared->xidLastCommit;
+	if (ts)
+		*ts = commitTsShared->dataLastCommit.time;
+	if (nodeid)
+		*nodeid = commitTsShared->dataLastCommit.nodeid;
+	LWLockRelease(CommitTsLock);
+
+	return xid;
+}
+
+/*
+ * SQL-callable wrapper to obtain commit time of a transaction
+ */
+Datum
+pg_xact_commit_timestamp(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid = PG_GETARG_UINT32(0);
+	TimestampTz		ts;
+	bool			found;
+
+	found = TransactionIdGetCommitTsData(xid, &ts, NULL);
+
+	if (!found)
+		PG_RETURN_NULL();
+
+	PG_RETURN_TIMESTAMPTZ(ts);
+}
+
+
+Datum
+pg_last_committed_xact(PG_FUNCTION_ARGS)
+{
+	TransactionId	xid;
+	TimestampTz		ts;
+	Datum       values[2];
+	bool        nulls[2];
+	TupleDesc   tupdesc;
+	HeapTuple	htup;
+
+	/* and construct a tuple with our data */
+	xid = GetLatestCommitTsData(&ts, NULL);
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(2, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "xid",
+					   XIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	if (xid == InvalidTransactionId)
+	{
+		memset(nulls, true, sizeof(nulls));
+	}
+	else
+	{
+		values[0] = TransactionIdGetDatum(xid);
+		nulls[0] = false;
+
+		values[1] = TimestampTzGetDatum(ts);
+		nulls[1] = false;
+	}
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+
+/*
+ * Number of shared CommitTS buffers.
+ *
+ * We use a very similar logic as for the number of CLOG buffers; see comments
+ * in CLOGShmemBuffers.
+ */
+Size
+CommitTsShmemBuffers(void)
+{
+	return Min(16, Max(4, NBuffers / 1024));
+}
+
+/*
+ * Shared memory sizing for CommitTs
+ */
+Size
+CommitTsShmemSize(void)
+{
+	return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+		sizeof(CommitTimestampShared);
+}
+
+/*
+ * Initialize CommitTs at system startup (postmaster start or standalone
+ * backend)
+ */
+void
+CommitTsShmemInit(void)
+{
+	bool	found;
+
+	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
+	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
+				  CommitTsControlLock, "pg_commit_ts");
+
+	commitTsShared = ShmemInitStruct("CommitTs shared",
+									 sizeof(CommitTimestampShared),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+
+		commitTsShared->xidLastCommit = InvalidTransactionId;
+		TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
+		commitTsShared->dataLastCommit.nodeid = InvalidCommitTsNodeId;
+	}
+	else
+		Assert(found);
+}
+
+/*
+ * This function must be called ONCE on system install.
+ *
+ * (The CommitTs directory is assumed to have been created by initdb, and
+ * CommitTsShmemInit must have been called already.)
+ */
+void
+BootStrapCommitTs(void)
+{
+	/*
+	 * Nothing to do here at present, unlike most other SLRU modules; segments
+	 * are created when the server is started with this module enabled.
+	 * See StartupCommitTs.
+	 */
+}
+
+/*
+ * Initialize (or reinitialize) a page of CommitTs to zeroes.
+ * If writeXlog is TRUE, also emit an XLOG record saying we did this.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroCommitTsPage(int pageno, bool writeXlog)
+{
+	int			slotno;
+
+	slotno = SimpleLruZeroPage(CommitTsCtl, pageno);
+
+	if (writeXlog)
+		WriteZeroPageXlogRec(pageno);
+
+	return slotno;
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ */
+void
+StartupCommitTs(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/*
+	 * Initialize our idea of the latest page number.
+	 */
+	CommitTsCtl->shared->latest_page_number = pageno;
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * when commit timestamp is enabled.  Must be called after recovery has
+ * finished.
+ *
+ * This is in charge of creating the currently active segment, if it's not
+ * already there.  The reason for this is that the server might have been
+ * running with this module disabled for a while and thus might have skipped
+ * the normal creation point.
+ */
+void
+CompleteCommitTsInitialization(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	CommitTsCtl->shared->latest_page_number = pageno;
+	LWLockRelease(CommitTsControlLock);
+
+	/*
+	 * If this module is not currently enabled, make sure we don't hand back
+	 * possibly-invalid data; also remove segments of old data.
+	 */
+	if (!track_commit_timestamp)
+	{
+		LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+		ShmemVariableCache->newestCommitTs = InvalidTransactionId;
+		LWLockRelease(CommitTsLock);
+
+		TruncateCommitTs(ReadNewTransactionId());
+
+		return;
+	}
+
+	/*
+	 * If CommitTs is enabled, but it wasn't in the previous server run, we
+	 * need to set the oldest and newest values to the next Xid; that way, we
+	 * will not try to read data that might not have been set.
+	 *
+	 * XXX does this have a problem if a server is started with commitTs
+	 * enabled, then started with commitTs disabled, then restarted with it
+	 * enabled again?  It doesn't look like it does, because there should be a
+	 * checkpoint that sets the value to InvalidTransactionId at end of
+	 * recovery; and so any chance of injecting new transactions without
+	 * CommitTs values would occur after the oldestCommitTs has been set to
+	 * Invalid temporarily.
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs == InvalidTransactionId)
+	{
+		ShmemVariableCache->oldestCommitTs =
+			ShmemVariableCache->newestCommitTs = ReadNewTransactionId();
+	}
+	LWLockRelease(CommitTsLock);
+
+	/* Finally, create the current segment file, if necessary */
+	if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
+	{
+		int		slotno;
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+		LWLockRelease(CommitTsControlLock);
+	}
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointCommitTs(void)
+{
+	/* Flush dirty CommitTs pages to disk */
+	SimpleLruFlush(CommitTsCtl, true);
+}
+
+/*
+ * Make sure that CommitTs has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty CommitTs or xlog page to make room
+ * in shared memory.
+ *
+ * NB: the current implementation relies on track_commit_timestamp being
+ * PGC_POSTMASTER.
+ */
+void
+ExtendCommitTs(TransactionId newestXact)
+{
+	int			pageno;
+
+	/* nothing to do if module not enabled */
+	if (!track_commit_timestamp)
+		return;
+
+	/*
+	 * No work except at first XID of a page.  But beware: just after
+	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
+	 */
+	if (TransactionIdToCTsEntry(newestXact) != 0 &&
+		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
+		return;
+
+	pageno = TransactionIdToCTsPage(newestXact);
+
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+	/* Zero the page and make an XLOG entry about it */
+	ZeroCommitTsPage(pageno, !InRecovery);
+
+	LWLockRelease(CommitTsControlLock);
+}
+
+/*
+ * Remove all CommitTs segments before the one holding the passed
+ * transaction ID.
+ *
+ * Note that we don't need to flush XLOG here.
+ */
+void
+TruncateCommitTs(TransactionId oldestXact)
+{
+	int			cutoffPage;
+
+	/*
+	 * The cutoff point is the start of the segment containing oldestXact. We
+	 * pass the *page* containing oldestXact to SimpleLruTruncate.
+	 */
+	cutoffPage = TransactionIdToCTsPage(oldestXact);
+
+	/* Check to see if there's any files that could be removed */
+	if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence,
+						   &cutoffPage))
+		return;					/* nothing to remove */
+
+	/* Write XLOG record */
+	WriteTruncateXlogRec(cutoffPage);
+
+	/* Now we can remove the old CommitTs segment(s) */
+	SimpleLruTruncate(CommitTsCtl, cutoffPage);
+}
+
+/*
+ * Set the limit values between which commit TS can be consulted.
+ */
+void
+SetCommitTsLimit(TransactionId oldestXact, TransactionId newestXact)
+{
+	/*
+	 * Be careful not to overwrite values that are either further into the
+	 * "future" or signal a disabled committs.
+	 */
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId)
+	{
+		if (TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+			ShmemVariableCache->oldestCommitTs = oldestXact;
+		if (TransactionIdPrecedes(newestXact, ShmemVariableCache->newestCommitTs))
+			ShmemVariableCache->newestCommitTs = newestXact;
+	}
+	else
+	{
+		Assert(ShmemVariableCache->newestCommitTs == InvalidTransactionId);
+	}
+	LWLockRelease(CommitTsLock);
+}
+
+/*
+ * Move forwards the oldest commitTS value that can be consulted
+ */
+void
+AdvanceOldestCommitTs(TransactionId oldestXact)
+{
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	if (ShmemVariableCache->oldestCommitTs != InvalidTransactionId &&
+		TransactionIdPrecedes(ShmemVariableCache->oldestCommitTs, oldestXact))
+		ShmemVariableCache->oldestCommitTs = oldestXact;
+	LWLockRelease(CommitTsLock);
+}
+
+
+/*
+ * Decide which of two CLOG page numbers is "older" for truncation purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+CommitTsPagePrecedes(int page1, int page2)
+{
+	TransactionId xid1;
+	TransactionId xid2;
+
+	xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE;
+	xid1 += FirstNormalTransactionId;
+	xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE;
+	xid2 += FirstNormalTransactionId;
+
+	return TransactionIdPrecedes(xid1, xid2);
+}
+
+
+/*
+ * Write a ZEROPAGE xlog record
+ */
+static void
+WriteZeroPageXlogRec(int pageno)
+{
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&pageno), sizeof(int));
+	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_ZEROPAGE);
+}
+
+/*
+ * Write a TRUNCATE xlog record
+ */
+static void
+WriteTruncateXlogRec(int pageno)
+{
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&pageno), sizeof(int));
+	(void) XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_TRUNCATE);
+}
+
+/*
+ * Write a SETTS xlog record
+ */
+static void
+WriteSetTimestampXlogRec(TransactionId mainxid, int nsubxids,
+						 TransactionId *subxids, TimestampTz timestamp,
+						 CommitTsNodeId nodeid)
+{
+	xl_commit_ts_set	record;
+
+	record.timestamp = timestamp;
+	record.nodeid = nodeid;
+	record.mainxid = mainxid;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &record,
+					 offsetof(xl_commit_ts_set, mainxid) +
+					 sizeof(TransactionId));
+	XLogRegisterData((char *) subxids, nsubxids * sizeof(TransactionId));
+	XLogInsert(RM_COMMIT_TS_ID, COMMIT_TS_SETTS);
+}
+
+/*
+ * CommitTS resource manager's routines
+ */
+void
+commit_ts_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	/* Backup blocks are not used in commit_ts records */
+	Assert(!XLogRecHasAnyBlockRefs(record));
+
+	if (info == COMMIT_TS_ZEROPAGE)
+	{
+		int			pageno;
+		int			slotno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+
+		slotno = ZeroCommitTsPage(pageno, false);
+		SimpleLruWritePage(CommitTsCtl, slotno);
+		Assert(!CommitTsCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(CommitTsControlLock);
+	}
+	else if (info == COMMIT_TS_TRUNCATE)
+	{
+		int			pageno;
+
+		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+
+		/*
+		 * During XLOG replay, latest_page_number isn't set up yet; insert a
+		 * suitable value to bypass the sanity test in SimpleLruTruncate.
+		 */
+		CommitTsCtl->shared->latest_page_number = pageno;
+
+		SimpleLruTruncate(CommitTsCtl, pageno);
+	}
+	else if (info == COMMIT_TS_SETTS)
+	{
+		xl_commit_ts_set *setts = (xl_commit_ts_set *) XLogRecGetData(record);
+		int			nsubxids;
+		TransactionId *subxids;
+
+		nsubxids = ((XLogRecGetDataLen(record) - SizeOfCommitTsSet) /
+					sizeof(TransactionId));
+		if (nsubxids > 0)
+		{
+			subxids = palloc(sizeof(TransactionId) * nsubxids);
+			memcpy(subxids,
+				   XLogRecGetData(record) + SizeOfCommitTsSet,
+				   sizeof(TransactionId) * nsubxids);
+		}
+		else
+			subxids = NULL;
+
+		TransactionTreeSetCommitTsData(setts->mainxid, nsubxids, subxids,
+									   setts->timestamp, setts->nodeid, false);
+		if (subxids)
+			pfree(subxids);
+	}
+	else
+		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index befd60f..dcf423b 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -8,6 +8,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 1f9a100..15596c7 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1297,7 +1297,7 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
 
 		len = strlen(clde->d_name);
 
-		if ((len == 4 || len == 5) &&
+		if ((len == 4 || len == 5 || len == 6) &&
 			strspn(clde->d_name, "0123456789ABCDEF") == len)
 		{
 			segno = (int) strtol(clde->d_name, NULL, 16);
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index d51cca4..c541156 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -158,9 +159,10 @@ GetNewTransactionId(bool isSubXact)
 	 * XID before we zero the page.  Fortunately, a page of the commit log
 	 * holds 32K or more transactions, so we don't have to do this very often.
 	 *
-	 * Extend pg_subtrans too.
+	 * Extend pg_subtrans and pg_commit_ts too.
 	 */
 	ExtendCLOG(xid);
+	ExtendCommitTs(xid);
 	ExtendSUBTRANS(xid);
 
 	/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 763e9de..8b2f714 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -20,6 +20,7 @@
 #include <time.h>
 #include <unistd.h>
 
+#include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1135,6 +1136,21 @@ RecordTransactionCommit(void)
 	}
 
 	/*
+	 * We only need to log the commit timestamp separately if the node
+	 * identifier is a valid value; the commit record above already contains
+	 * the timestamp info otherwise, and will be used to load it.
+	 */
+	if (markXidCommitted)
+	{
+		CommitTsNodeId		node_id;
+
+		node_id = CommitTsGetDefaultNodeId();
+		TransactionTreeSetCommitTsData(xid, nchildren, children,
+									   xactStopTimestamp,
+									   node_id, node_id != InvalidCommitTsNodeId);
+	}
+
+	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
 	 * transaction has not performed any WAL-logged operation.  The latter
@@ -4644,6 +4660,7 @@ xactGetCommittedChildren(TransactionId **ptr)
  */
 static void
 xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
+						  TimestampTz commit_time,
 						  TransactionId *sub_xids, int nsubxacts,
 						  SharedInvalidationMessage *inval_msgs, int nmsgs,
 						  RelFileNode *xnodes, int nrels,
@@ -4671,6 +4688,10 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		LWLockRelease(XidGenLock);
 	}
 
+	/* Set the transaction commit timestamp and metadata */
+	TransactionTreeSetCommitTsData(xid, nsubxacts, sub_xids,
+								   commit_time, InvalidCommitTsNodeId, false);
+
 	if (standbyState == STANDBY_DISABLED)
 	{
 		/*
@@ -4790,7 +4811,8 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  subxacts, xlrec->nsubxacts,
 							  inval_msgs, xlrec->nmsgs,
 							  xlrec->xnodes, xlrec->nrels,
 							  xlrec->dbId,
@@ -4805,7 +4827,8 @@ static void
 xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
 						 TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxacts,
+	xact_redo_commit_internal(xid, lsn, xlrec->xact_time,
+							  xlrec->subxacts, xlrec->nsubxacts,
 							  NULL, 0,	/* inval msgs */
 							  NULL, 0,	/* relfilenodes */
 							  InvalidOid,		/* dbId */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2ad5eb..afea3c3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -22,6 +22,7 @@
 #include <unistd.h>
 
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4518,6 +4519,8 @@ BootStrapXLOG(void)
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.oldestMulti = FirstMultiXactId;
 	checkPoint.oldestMultiDB = TemplateDbOid;
+	checkPoint.oldestCommitTs = InvalidTransactionId;
+	checkPoint.newestCommitTs = InvalidTransactionId;
 	checkPoint.time = (pg_time_t) time(NULL);
 	checkPoint.oldestActiveXid = InvalidTransactionId;
 
@@ -4527,6 +4530,7 @@ BootStrapXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
 
 	/* Set up the XLOG page header */
 	page->xlp_magic = XLOG_PAGE_MAGIC;
@@ -4606,6 +4610,7 @@ BootStrapXLOG(void)
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
+	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
@@ -4614,6 +4619,7 @@ BootStrapXLOG(void)
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
+	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
 
@@ -5920,6 +5926,10 @@ StartupXLOG(void)
 	ereport(DEBUG1,
 			(errmsg("oldest MultiXactId: %u, in database %u",
 					checkPoint.oldestMulti, checkPoint.oldestMultiDB)));
+	ereport(DEBUG1,
+			(errmsg("commit timestamp Xid oldest/newest: %u/%u",
+					checkPoint.oldestCommitTs,
+					checkPoint.newestCommitTs)));
 	if (!TransactionIdIsNormal(checkPoint.nextXid))
 		ereport(PANIC,
 				(errmsg("invalid next transaction ID")));
@@ -5931,6 +5941,8 @@ StartupXLOG(void)
 	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
 	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
+	SetCommitTsLimit(checkPoint.oldestCommitTs,
+					 checkPoint.newestCommitTs);
 	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
@@ -6153,11 +6165,12 @@ StartupXLOG(void)
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
-			 * Startup commit log and subtrans only. MultiXact has already
-			 * been started up and other SLRUs are not maintained during
-			 * recovery and need not be started yet.
+			 * Startup commit log, commit timestamp and subtrans only.
+			 * MultiXact has already been started up and other SLRUs are not
+			 * maintained during recovery and need not be started yet.
 			 */
 			StartupCLOG();
+			StartupCommitTs();
 			StartupSUBTRANS(oldestActiveXID);
 
 			/*
@@ -6827,12 +6840,13 @@ StartupXLOG(void)
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.
+	 * Start up the commit log, commit timestamp and subtrans, if not already
+	 * done for hot standby.
 	 */
 	if (standbyState == STANDBY_DISABLED)
 	{
 		StartupCLOG();
+		StartupCommitTs();
 		StartupSUBTRANS(oldestActiveXID);
 	}
 
@@ -6868,6 +6882,12 @@ StartupXLOG(void)
 	XLogReportParameters();
 
 	/*
+	 * Local WAL inserts enabled, so it's time to finish initialization
+	 * of commit timestamp.
+	 */
+	CompleteCommitTsInitialization();
+
+	/*
 	 * All done.  Allow backends to write WAL.  (Although the bool flag is
 	 * probably atomic in itself, we use the info_lck here to ensure that
 	 * there are no race conditions concerning visibility of other recent
@@ -7433,6 +7453,7 @@ ShutdownXLOG(int code, Datum arg)
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 	ShutdownCLOG();
+	ShutdownCommitTs();
 	ShutdownSUBTRANS();
 	ShutdownMultiXact();
 
@@ -7769,6 +7790,11 @@ CreateCheckPoint(int flags)
 	checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
 	LWLockRelease(XidGenLock);
 
+	LWLockAcquire(CommitTsControlLock, LW_SHARED);
+	checkPoint.oldestCommitTs = ShmemVariableCache->oldestCommitTs;
+	checkPoint.newestCommitTs = ShmemVariableCache->newestCommitTs;
+	LWLockRelease(CommitTsControlLock);
+
 	/* Increase XID epoch if we've wrapped around since last checkpoint */
 	checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
 	if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
@@ -8046,6 +8072,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
 	CheckPointCLOG();
+	CheckPointCommitTs();
 	CheckPointSUBTRANS();
 	CheckPointMultiXact();
 	CheckPointPredicate();
@@ -8474,7 +8501,8 @@ XLogReportParameters(void)
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
-		max_locks_per_xact != ControlFile->max_locks_per_xact)
+		max_locks_per_xact != ControlFile->max_locks_per_xact ||
+		track_commit_timestamp != ControlFile->track_commit_timestamp)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -8494,6 +8522,7 @@ XLogReportParameters(void)
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
+			xlrec.track_commit_timestamp = track_commit_timestamp;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -8508,6 +8537,7 @@ XLogReportParameters(void)
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 		UpdateControlFile();
 	}
 }
@@ -8884,6 +8914,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
+		ControlFile->track_commit_timestamp = track_commit_timestamp;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f1bf728..f3d610f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -299,7 +299,7 @@ XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
  * Add data to the WAL record that's being constructed.
  *
  * The data is appended to the "main chunk", available at replay with
- * XLogGetRecData().
+ * XLogRecGetData().
  */
 void
 XLogRegisterData(char *data, int len)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6384dc7..e32e039 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -23,6 +23,7 @@
 #include <math.h>
 
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -1071,10 +1072,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG to the oldest computed value.  Note we don't truncate
-	 * multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG and CommitTs to the oldest computed value.
+	 * Note we don't truncate multixacts; that will be done by the next
+	 * checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
+	TruncateCommitTs(frozenXID);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
@@ -1084,6 +1087,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid);
+	AdvanceOldestCommitTs(frozenXID);
 }
 
 
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 800dcd9..d43c8ff 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -1438,7 +1438,7 @@ parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int line_num)
 				ereport(LOG,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("client certificates can only be checked if a root certificate store is available"),
-						 errhint("Make sure the configuration parameter \"ssl_ca_file\" is set."),
+						 errhint("Make sure the configuration parameter \"%s\" is set.", "ssl_ca_file"),
 						 errcontext("line %d of configuration file \"%s\"",
 									line_num, HbaFileName)));
 				return false;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 4e81322..a455b92 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -133,6 +133,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
+		case RM_COMMIT_TS_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1d04c55..b9577cd 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -117,6 +118,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
 		size = add_size(size, BackgroundWorkerShmemSize());
@@ -198,6 +200,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
 	InitBufferPool();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 719181c..c9f8657 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
@@ -259,6 +260,9 @@ NumLWLocks(void)
 	/* clog.c needs one per CLOG buffer */
 	numLocks += CLOGShmemBuffers();
 
+	/* commit_ts.c needs one per CommitTs buffer */
+	numLocks += CommitTsShmemBuffers();
+
 	/* subtrans.c needs one per SubTrans buffer */
 	numLocks += NUM_SUBTRANS_BUFFERS;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d4d74ba..b1bff7f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -26,6 +26,7 @@
 #include <syslog.h>
 #endif
 
+#include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -836,6 +837,15 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+			gettext_noop("Collects transaction commit time."),
+			NULL
+		},
+		&track_commit_timestamp,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"ssl", PGC_POSTMASTER, CONN_AUTH_SECURITY,
 			gettext_noop("Enables SSL connections."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4a89cb7..c4b546e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -228,6 +228,8 @@
 
 #max_replication_slots = 0	# max number of replication slots
 				# (change requires restart)
+#track_commit_timestamp = off	# collect timestamp of transaction commit
+				# (change requires restart)
 
 # - Master Server -
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3b52867..3bee657 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -186,6 +186,7 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
+	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index b2e0793..a838bb5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -270,6 +270,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
@@ -300,6 +302,8 @@ main(int argc, char *argv[])
 		   ControlFile.max_prepared_xacts);
 	printf(_("Current max_locks_per_xact setting:   %d\n"),
 		   ControlFile.max_locks_per_xact);
+	printf(_("Current track_commit_timestamp setting: %s\n"),
+		   ControlFile.track_commit_timestamp ? _("on") : _("off"));
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 666e8db..8f67c18 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -63,6 +63,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_xid = 0;
+static TransactionId set_commit_ts = 0;
 static Oid	set_oid = 0;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
@@ -112,7 +113,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "D:fl:m:no:O:x:e:")) != -1)
+	while ((c = getopt(argc, argv, "c:D:e:fl:m:no:O:x:")) != -1)
 	{
 		switch (c)
 		{
@@ -158,6 +159,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'c':
+				set_commit_ts = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option -c\n"), progname);
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (set_commit_ts == 0)
+				{
+					fprintf(stderr, _("%s: transaction ID (-c) must not be 0\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'o':
 				set_oid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != '\0')
@@ -345,6 +361,9 @@ main(int argc, char *argv[])
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
 	}
 
+	if (set_commit_ts != 0)
+		ControlFile.checkPointCopy.oldestCommitTs = set_commit_ts;
+
 	if (set_oid != 0)
 		ControlFile.checkPointCopy.nextOid = set_oid;
 
@@ -539,6 +558,7 @@ GuessControlValues(void)
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -621,6 +641,8 @@ PrintControlValues(bool guessed)
 		   ControlFile.checkPointCopy.oldestMulti);
 	printf(_("Latest checkpoint's oldestMulti's DB: %u\n"),
 		   ControlFile.checkPointCopy.oldestMultiDB);
+	printf(_("Latest checkpoint's oldestCommitTs:   %u\n"),
+		   ControlFile.checkPointCopy.oldestCommitTs);
 	printf(_("Maximum data alignment:               %u\n"),
 		   ControlFile.maxAlign);
 	/* we don't print floatFormat since can't say much useful about it */
@@ -702,6 +724,12 @@ PrintNewControlValues()
 		printf(_("NextXID epoch:                        %u\n"),
 			   ControlFile.checkPointCopy.nextXidEpoch);
 	}
+
+	if (set_commit_ts != 0)
+	{
+		printf(_("oldestCommitTs:                       %u\n"),
+			   ControlFile.checkPointCopy.oldestCommitTs);
+	}
 }
 
 
@@ -739,6 +767,7 @@ RewriteControlFile(void)
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
+	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
@@ -1099,6 +1128,7 @@ usage(void)
 	printf(_("%s resets the PostgreSQL transaction log.\n\n"), progname);
 	printf(_("Usage:\n  %s [OPTION]... {[-D] DATADIR}\n\n"), progname);
 	printf(_("Options:\n"));
+	printf(_("  -c XID           set the oldest transaction with retrievable commit timestamp\n"));
 	printf(_("  -e XIDEPOCH      set next transaction ID epoch\n"));
 	printf(_("  -f               force update to be done\n"));
 	printf(_("  -l XLOGFILE      force minimum WAL starting location for new transaction log\n"));
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
new file mode 100644
index 0000000..903c82c
--- /dev/null
+++ b/src/include/access/commit_ts.h
@@ -0,0 +1,72 @@
+/*
+ * commit_ts.h
+ *
+ * PostgreSQL commit timestamp manager
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/commit_ts.h
+ */
+#ifndef COMMIT_TS_H
+#define COMMIT_TS_H
+
+#include "access/xlog.h"
+#include "datatype/timestamp.h"
+#include "utils/guc.h"
+
+
+extern PGDLLIMPORT bool	track_commit_timestamp;
+
+extern bool check_track_commit_timestamp(bool *newval, void **extra,
+							 GucSource source);
+
+typedef uint32 CommitTsNodeId;
+#define InvalidCommitTsNodeId 0
+
+extern void CommitTsSetDefaultNodeId(CommitTsNodeId nodeid);
+extern CommitTsNodeId CommitTsGetDefaultNodeId(void);
+extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
+							   TransactionId *subxids, TimestampTz timestamp,
+							   CommitTsNodeId nodeid, bool do_xlog);
+extern bool TransactionIdGetCommitTsData(TransactionId xid,
+							 TimestampTz *ts, CommitTsNodeId *nodeid);
+extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
+					  CommitTsNodeId *nodeid);
+
+extern Size CommitTsShmemBuffers(void);
+extern Size CommitTsShmemSize(void);
+extern void CommitTsShmemInit(void);
+extern void BootStrapCommitTs(void);
+extern void StartupCommitTs(void);
+extern void CompleteCommitTsInitialization(void);
+extern void ShutdownCommitTs(void);
+extern void CheckPointCommitTs(void);
+extern void ExtendCommitTs(TransactionId newestXact);
+extern void TruncateCommitTs(TransactionId oldestXact);
+extern void SetCommitTsLimit(TransactionId oldestXact,
+				 TransactionId newestXact);
+extern void AdvanceOldestCommitTs(TransactionId oldestXact);
+
+/* XLOG stuff */
+#define COMMIT_TS_ZEROPAGE		0x00
+#define COMMIT_TS_TRUNCATE		0x10
+#define COMMIT_TS_SETTS			0x20
+
+typedef struct xl_commit_ts_set
+{
+	TimestampTz		timestamp;
+	CommitTsNodeId	nodeid;
+	TransactionId	mainxid;
+	/* subxact Xids follow */
+} xl_commit_ts_set;
+
+#define SizeOfCommitTsSet	(offsetof(xl_commit_ts_set, mainxid) + \
+							 sizeof(TransactionId))
+
+
+extern void commit_ts_redo(XLogReaderState *record);
+extern void commit_ts_desc(StringInfo buf, XLogReaderState *record);
+extern const char *commit_ts_identify(uint8 info);
+
+#endif   /* COMMITTS_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 76a6421..27168c3 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,7 +24,7 @@
  * Changes to this list possibly need a XLOG_PAGE_MAGIC bump.
  */
 
-/* symbol name, textual name, redo, desc, startup, cleanup */
+/* symbol name, textual name, redo, desc, identify, startup, cleanup */
 PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL)
 PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL)
 PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL)
@@ -43,3 +43,4 @@ PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_start
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 32d1b29..6666434 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -124,6 +124,12 @@ typedef struct VariableCacheData
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 
 	/*
+	 * These fields are protected by CommitTsLock
+	 */
+	TransactionId oldestCommitTs;
+	TransactionId newestCommitTs;
+
+	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 85b3fe7..825cf54 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -186,6 +186,7 @@ typedef struct xl_parameter_change
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
+	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 8093260..6c1d650 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201411281
+#define CATALOG_VERSION_NO	201412011
 
 #endif
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 15f81e4..6e9cac9 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -46,6 +46,8 @@ typedef struct CheckPoint
 	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
 	Oid			oldestMultiDB;	/* database with minimum datminmxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+	TransactionId oldestCommitTs; /* oldest Xid with valid commit timestamp */
+	TransactionId newestCommitTs; /* newest Xid with valid commit timestamp */
 
 	/*
 	 * Oldest XID still running. This is only needed to initialize hot standby
@@ -177,6 +179,7 @@ typedef struct ControlFileData
 	int			max_worker_processes;
 	int			max_prepared_xacts;
 	int			max_locks_per_xact;
+	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 56399ac..d0b4709 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3023,6 +3023,12 @@ DESCR("view two-phase transactions");
 DATA(insert OID = 3819 (  pg_get_multixact_members PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 2249 "28" "{28,28,25}" "{i,o,o}" "{multixid,xid,mode}" _null_ pg_get_multixact_members _null_ _null_ _null_ ));
 DESCR("view members of a multixactid");
 
+DATA(insert OID = 3581 ( pg_xact_commit_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 1184 "28" _null_ _null_ _null_ _null_ pg_xact_commit_timestamp _null_ _null_ _null_ ));
+DESCR("get commit timestamp of a transaction");
+
+DATA(insert OID = 3583 ( pg_last_committed_xact PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 2249 "" "{28,1184}" "{o,o}" "{xid,timestamp}" _null_ pg_last_committed_xact _null_ _null_ _null_ ));
+DESCR("get transaction Id and commit timestamp of latest transaction commit");
+
 DATA(insert OID = 3537 (  pg_describe_object		PGNSP PGUID 12 1 0 0 0 f f f f t f s 3 0 25 "26 26 23" _null_ _null_ _null_ _null_ pg_describe_object _null_ _null_ _null_ ));
 DESCR("get identification of SQL object");
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 91cab87..09654a8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define CommitTsControlLock			(&MainLWLockArray[38].lock)
+#define CommitTsLock				(&MainLWLockArray[39].lock)
+
+#define NUM_INDIVIDUAL_LWLOCKS		40
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 417fd17..565cff3 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1187,6 +1187,10 @@ extern Datum pg_prepared_xact(PG_FUNCTION_ARGS);
 /* access/transam/multixact.c */
 extern Datum pg_get_multixact_members(PG_FUNCTION_ARGS);
 
+/* access/transam/committs.c */
+extern Datum pg_xact_commit_timestamp(PG_FUNCTION_ARGS);
+extern Datum pg_last_committed_xact(PG_FUNCTION_ARGS);
+
 /* catalogs/dependency.c */
 extern Datum pg_describe_object(PG_FUNCTION_ARGS);
 extern Datum pg_identify_object(PG_FUNCTION_ARGS);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9d5aa97..5f1cbb0 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -5,6 +5,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
+		  commit_ts \
 		  worker_spi \
 		  dummy_seclabel \
 		  test_shm_mq \
diff --git a/src/test/modules/commit_ts/.gitignore b/src/test/modules/commit_ts/.gitignore
new file mode 100644
index 0000000..1f95503
--- /dev/null
+++ b/src/test/modules/commit_ts/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/src/test/modules/commit_ts/Makefile b/src/test/modules/commit_ts/Makefile
new file mode 100644
index 0000000..b3cb315
--- /dev/null
+++ b/src/test/modules/commit_ts/Makefile
@@ -0,0 +1,15 @@
+# src/test/modules/commit_ts/Makefile
+
+REGRESS = commit_timestamp
+REGRESS_OPTS = --temp-config=$(top_srcdir)/src/test/modules/commit_ts/commit_ts.conf
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/commit_ts
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/commit_ts/commit_ts.conf b/src/test/modules/commit_ts/commit_ts.conf
new file mode 100644
index 0000000..d221a60
--- /dev/null
+++ b/src/test/modules/commit_ts/commit_ts.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/src/test/modules/commit_ts/expected/commit_timestamp.out b/src/test/modules/commit_ts/expected/commit_timestamp.out
new file mode 100644
index 0000000..c1d24c5
--- /dev/null
+++ b/src/test/modules/commit_ts/expected/commit_timestamp.out
@@ -0,0 +1,39 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ on
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | ?column? | ?column? | ?column? 
+----+----------+----------+----------
+  1 | t        | t        | t
+  2 | t        | t        | t
+  3 | t        | t        | t
+(3 rows)
+
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ pg_xact_commit_timestamp 
+--------------------------
+ 
+(1 row)
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ ?column? | ?column? | ?column? 
+----------+----------+----------
+ t        | t        | t
+(1 row)
+
diff --git a/src/test/modules/commit_ts/expected/commit_timestamp_1.out b/src/test/modules/commit_ts/expected/commit_timestamp_1.out
new file mode 100644
index 0000000..60d73e3
--- /dev/null
+++ b/src/test/modules/commit_ts/expected/commit_timestamp_1.out
@@ -0,0 +1,28 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+ track_commit_timestamp 
+------------------------
+ off
+(1 row)
+
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ERROR:  could not get commit timestamp data
+HINT:  Make sure the configuration parameter "track_commit_timestamp" is set.
+DROP TABLE committs_test;
+SELECT pg_xact_commit_timestamp('0'::xid);
+ERROR:  could not get commit timestamp data
+HINT:  Make sure the configuration parameter "track_commit_timestamp" is set.
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+ERROR:  could not get commit timestamp data
+HINT:  Make sure the configuration parameter "track_commit_timestamp" is set.
diff --git a/src/test/modules/commit_ts/sql/commit_timestamp.sql b/src/test/modules/commit_ts/sql/commit_timestamp.sql
new file mode 100644
index 0000000..acd6de0
--- /dev/null
+++ b/src/test/modules/commit_ts/sql/commit_timestamp.sql
@@ -0,0 +1,22 @@
+--
+-- Commit Timestamp
+--
+SHOW track_commit_timestamp;
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+
+SELECT id,
+       pg_xact_commit_timestamp(xmin) >= ts,
+       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+
+DROP TABLE committs_test;
+
+SELECT pg_xact_commit_timestamp('0'::xid);
+
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
#132Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#131)
Re: tracking commit timestamps

Pushed with some extra cosmetic tweaks.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133Petr Jelinek
petr@2ndquadrant.com
In reply to: Alvaro Herrera (#132)
Re: tracking commit timestamps

On 03/12/14 15:54, Alvaro Herrera wrote:

Pushed with some extra cosmetic tweaks.

Cool, thanks!

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#131)
Re: tracking commit timestamps

On Mon, Dec 1, 2014 at 5:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I made two more changes:
1. introduce newestCommitTs. Original code was using lastCommitXact to
check that no "future" transaction is asked for, but this doesn't really
work if a long-running transaction is committed, because asking for
transactions with a higher Xid but which were committed earlier would
raise an error.

I'm kind of disappointed that, in spite of previous review comments,
this got committed with extensive use of the CommitTs naming. I think
that's confusing, but it's also something that will be awkward if we
want to add other data, such as the much-discussed commit LSN, to the
facility.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#134)
Re: tracking commit timestamps

Robert Haas wrote:

On Mon, Dec 1, 2014 at 5:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I made two more changes:
1. introduce newestCommitTs. Original code was using lastCommitXact to
check that no "future" transaction is asked for, but this doesn't really
work if a long-running transaction is committed, because asking for
transactions with a higher Xid but which were committed earlier would
raise an error.

I'm kind of disappointed that, in spite of previous review comments,
this got committed with extensive use of the CommitTs naming. I think
that's confusing, but it's also something that will be awkward if we
want to add other data, such as the much-discussed commit LSN, to the
facility.

I never saw a comment that CommitTs was an unwanted name. There were
some that said that committs wasn't liked because it looked like a
misspelling, so we added an underscore -- stuff in lower case is
commit_ts everywhere. Stuff in camel case didn't get the underscore
because it didn't seem necessary. But other than that issue, the name
wasn't questioned, as far as I'm aware.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#135)
Re: tracking commit timestamps

Alvaro Herrera wrote:

Robert Haas wrote:

I'm kind of disappointed that, in spite of previous review comments,
this got committed with extensive use of the CommitTs naming. I think
that's confusing, but it's also something that will be awkward if we
want to add other data, such as the much-discussed commit LSN, to the
facility.

I never saw a comment that CommitTs was an unwanted name. There were
some that said that committs wasn't liked because it looked like a
misspelling, so we added an underscore -- stuff in lower case is
commit_ts everywhere. Stuff in camel case didn't get the underscore
because it didn't seem necessary. But other than that issue, the name
wasn't questioned, as far as I'm aware.

I found one email where you said you didn't like committs and preferred
commit_timestamp instead. I don't see how making that change would have
made you happy wrt the concern you just expressed.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#137Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#136)
Re: tracking commit timestamps

On Wed, Dec 3, 2014 at 2:36 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Alvaro Herrera wrote:

Robert Haas wrote:

I'm kind of disappointed that, in spite of previous review comments,
this got committed with extensive use of the CommitTs naming. I think
that's confusing, but it's also something that will be awkward if we
want to add other data, such as the much-discussed commit LSN, to the
facility.

I never saw a comment that CommitTs was an unwanted name. There were
some that said that committs wasn't liked because it looked like a
misspelling, so we added an underscore -- stuff in lower case is
commit_ts everywhere. Stuff in camel case didn't get the underscore
because it didn't seem necessary. But other than that issue, the name
wasn't questioned, as far as I'm aware.

I found one email where you said you didn't like committs and preferred
commit_timestamp instead. I don't see how making that change would have
made you happy wrt the concern you just expressed.

Fair point.

I'm still not sure we got this one right, but I don't know that I want
to spend more time wrangling about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#132)
Re: tracking commit timestamps

On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Pushed with some extra cosmetic tweaks.

I got the following assertion failure when I executed pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion("!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0)))", File: "commit_ts.c",
Line: 315)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2014-12-04
12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by
signal 6: Aborted
2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running:
select pg_xact_commit_timestamp('1000'::xid);

The way to reproduce this problem is

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the standby

BTW, at the step #4, I got the following log messages. This might be a hint for
this problem.

LOG: file "pg_commit_ts/0000" doesn't exist, reading as zeroes
CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
relcache 16384

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#138)
Re: tracking commit timestamps

On 4 December 2014 at 03:08, Fujii Masao <masao.fujii@gmail.com> wrote:

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the standby

I'm not sure what step4 is supposed to do?

Surely if steps 1-3 generate any WAL then the standby should replay
it, whether or not track_commit_timestamp is enabled.

So what effect does setting that parameter on the standby?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#139)
Re: tracking commit timestamps

On Thu, Dec 4, 2014 at 12:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 4 December 2014 at 03:08, Fujii Masao <masao.fujii@gmail.com> wrote:

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the standby

I'm not sure what step4 is supposed to do?

Surely if steps 1-3 generate any WAL then the standby should replay
it, whether or not track_commit_timestamp is enabled.

So what effect does setting that parameter on the standby?

At least track_commit_timestamp seems to need to be enabled even in the standby
when we want to call pg_xact_commit_timestamp() and pg_last_committed_xact()
in the standby. I'm not sure if this is good design, though.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#132)
1 attachment(s)
Re: tracking commit timestamps

On Wed, Dec 03, 2014 at 11:54:38AM -0300, Alvaro Herrera wrote:

Pushed with some extra cosmetic tweaks.

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003. I have not checked other Windows
configurations; the suite does pass on GNU/Linux.

Attachments:

commit_ts-regression.diffstext/plain; charset=us-asciiDownload
*** Z:/nm/postgresql/src/test/modules/commit_ts/expected/commit_timestamp.out	2014-12-05 05:43:01.074420000 +0000
--- Z:/nm/postgresql/src/test/modules/commit_ts/results/commit_timestamp.out	2014-12-05 08:24:13.094705200 +0000
***************
*** 19,27 ****
  ORDER BY id;
   id | ?column? | ?column? | ?column? 
  ----+----------+----------+----------
!   1 | t        | t        | t
!   2 | t        | t        | t
!   3 | t        | t        | t
  (3 rows)
  
  DROP TABLE committs_test;
--- 19,27 ----
  ORDER BY id;
   id | ?column? | ?column? | ?column? 
  ----+----------+----------+----------
!   1 | t        | f        | t
!   2 | t        | f        | t
!   3 | t        | f        | t
  (3 rows)
  
  DROP TABLE committs_test;
***************
*** 34,39 ****
  SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
   ?column? | ?column? | ?column? 
  ----------+----------+----------
!  t        | t        | t
  (1 row)
  
--- 34,39 ----
  SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
   ?column? | ?column? | ?column? 
  ----------+----------+----------
!  t        | t        | f
  (1 row)
  

======================================================================

#142Petr Jelinek
petr@2ndquadrant.com
In reply to: Noah Misch (#141)
Re: tracking commit timestamps

On 08/12/14 00:56, Noah Misch wrote:

On Wed, Dec 03, 2014 at 11:54:38AM -0300, Alvaro Herrera wrote:

Pushed with some extra cosmetic tweaks.

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003. I have not checked other Windows
configurations; the suite does pass on GNU/Linux.

Hmm I wonder if "< now()" needs to be changed to "<= now()" in those
queries to make them work correctly on that plarform, I don't have
machine with that environment handy right now, so I would appreciate if
you could try that, in case you don't have time for that, I will try to
setup something later...

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143Noah Misch
noah@leadboat.com
In reply to: Petr Jelinek (#142)
Re: tracking commit timestamps

On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote:

On 08/12/14 00:56, Noah Misch wrote:

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003. I have not checked other Windows
configurations; the suite does pass on GNU/Linux.

Hmm I wonder if "< now()" needs to be changed to "<= now()" in those queries
to make them work correctly on that plarform, I don't have machine with that
environment handy right now, so I would appreciate if you could try that, in
case you don't have time for that, I will try to setup something later...

I will try that, though perhaps not until next week.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144Michael Paquier
michael.paquier@gmail.com
In reply to: Noah Misch (#143)
1 attachment(s)
Re: tracking commit timestamps

On Wed, Dec 10, 2014 at 6:50 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote:

On 08/12/14 00:56, Noah Misch wrote:

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003. I have not checked other Windows
configurations; the suite does pass on GNU/Linux.

Hmm I wonder if "< now()" needs to be changed to "<= now()" in those queries
to make them work correctly on that plarform, I don't have machine with that
environment handy right now, so I would appreciate if you could try that, in
case you don't have time for that, I will try to setup something later...

I will try that, though perhaps not until next week.

FWIW, I just tried that with MinGW-32 and I can see the error on Win7.
I also checked that changing "< now()" to "<= now()" fixed the
problem, so your assumption was right, Petr.
Regards,
--
Michael

Attachments:

20141215_committs_mingw_fix.patchapplication/octet-stream; name=20141215_committs_mingw_fix.patchDownload
diff --git a/src/test/modules/commit_ts/expected/commit_timestamp.out b/src/test/modules/commit_ts/expected/commit_timestamp.out
index e40e28c..99f3322 100644
--- a/src/test/modules/commit_ts/expected/commit_timestamp.out
+++ b/src/test/modules/commit_ts/expected/commit_timestamp.out
@@ -13,7 +13,7 @@ INSERT INTO committs_test DEFAULT VALUES;
 INSERT INTO committs_test DEFAULT VALUES;
 SELECT id,
        pg_xact_commit_timestamp(xmin) >= ts,
-       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) <= now(),
        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
 FROM committs_test
 ORDER BY id;
@@ -31,7 +31,7 @@ SELECT pg_xact_commit_timestamp('1'::xid);
 ERROR:  cannot retrieve commit timestamp for transaction 1
 SELECT pg_xact_commit_timestamp('2'::xid);
 ERROR:  cannot retrieve commit timestamp for transaction 2
-SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp <= now() FROM pg_last_committed_xact() x;
  ?column? | ?column? | ?column? 
 ----------+----------+----------
  t        | t        | t
diff --git a/src/test/modules/commit_ts/sql/commit_timestamp.sql b/src/test/modules/commit_ts/sql/commit_timestamp.sql
index 9beb78a..4e041a5 100644
--- a/src/test/modules/commit_ts/sql/commit_timestamp.sql
+++ b/src/test/modules/commit_ts/sql/commit_timestamp.sql
@@ -10,7 +10,7 @@ INSERT INTO committs_test DEFAULT VALUES;
 
 SELECT id,
        pg_xact_commit_timestamp(xmin) >= ts,
-       pg_xact_commit_timestamp(xmin) < now(),
+       pg_xact_commit_timestamp(xmin) <= now(),
        pg_xact_commit_timestamp(xmin) - ts < '60s' -- 60s should give a lot of reserve
 FROM committs_test
 ORDER BY id;
@@ -21,4 +21,4 @@ SELECT pg_xact_commit_timestamp('0'::xid);
 SELECT pg_xact_commit_timestamp('1'::xid);
 SELECT pg_xact_commit_timestamp('2'::xid);
 
-SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp < now() FROM pg_last_committed_xact() x;
+SELECT x.xid::text::bigint > 0, x.timestamp > '-infinity'::timestamptz, x.timestamp <= now() FROM pg_last_committed_xact() x;
#145Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#144)
Re: tracking commit timestamps

On 15/12/14 09:12, Michael Paquier wrote:

On Wed, Dec 10, 2014 at 6:50 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote:

On 08/12/14 00:56, Noah Misch wrote:

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003. I have not checked other Windows
configurations; the suite does pass on GNU/Linux.

Hmm I wonder if "< now()" needs to be changed to "<= now()" in those queries
to make them work correctly on that plarform, I don't have machine with that
environment handy right now, so I would appreciate if you could try that, in
case you don't have time for that, I will try to setup something later...

I will try that, though perhaps not until next week.

FWIW, I just tried that with MinGW-32 and I can see the error on Win7.
I also checked that changing "< now()" to "<= now()" fixed the
problem, so your assumption was right, Petr.
Regards,

Cool, thanks, I think it was the time granularity problem in Windows.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#144)
Re: tracking commit timestamps

On Mon, Dec 15, 2014 at 12:12:10AM -0800, Michael Paquier wrote:

On Wed, Dec 10, 2014 at 6:50 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote:

On 08/12/14 00:56, Noah Misch wrote:

The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
running on 64-bit Windows Server 2003. I have not checked other Windows
configurations; the suite does pass on GNU/Linux.

Hmm I wonder if "< now()" needs to be changed to "<= now()" in those queries
to make them work correctly on that plarform, I don't have machine with that
environment handy right now, so I would appreciate if you could try that, in
case you don't have time for that, I will try to setup something later...

I will try that, though perhaps not until next week.

FWIW, I just tried that with MinGW-32 and I can see the error on Win7.
I also checked that changing "< now()" to "<= now()" fixed the
problem, so your assumption was right, Petr.

Committed, after fixing the alternate expected output.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#147Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noah Misch (#146)
Re: tracking commit timestamps

Noah Misch wrote:

On Mon, Dec 15, 2014 at 12:12:10AM -0800, Michael Paquier wrote:

FWIW, I just tried that with MinGW-32 and I can see the error on Win7.
I also checked that changing "< now()" to "<= now()" fixed the
problem, so your assumption was right, Petr.

Committed, after fixing the alternate expected output.

Thanks. I admit I don't understand what the issue is.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#147)
Re: tracking commit timestamps

On Tue, Dec 16, 2014 at 01:05:31AM -0300, Alvaro Herrera wrote:

Noah Misch wrote:

On Mon, Dec 15, 2014 at 12:12:10AM -0800, Michael Paquier wrote:

FWIW, I just tried that with MinGW-32 and I can see the error on Win7.
I also checked that changing "< now()" to "<= now()" fixed the
problem, so your assumption was right, Petr.

Committed, after fixing the alternate expected output.

Thanks. I admit I don't understand what the issue is.

The test assumed that no two transactions of a given backend will get the same
timestamp value from now(). That holds so long as ticks of the system time
are small enough. Not so on at least some Windows configurations. Notice the
repeated timestamp values:

Windows Server 2003 x64, 32-bit build w/ VS2010

localhost template1=# select clock_timestamp(), pg_sleep(.1 * (n % 2)) from generate_series(0,7) t(n);
clock_timestamp | pg_sleep
-------------------------------+----------
2014-12-18 08:34:34.522126+00 |
2014-12-18 08:34:34.522126+00 |
2014-12-18 08:34:34.631508+00 |
2014-12-18 08:34:34.631508+00 |
2014-12-18 08:34:34.74089+00 |
2014-12-18 08:34:34.74089+00 |
2014-12-18 08:34:34.850272+00 |
2014-12-18 08:34:34.850272+00 |
(8 rows)

GNU/Linux

[local] test=# select clock_timestamp(), pg_sleep(.1 * (n % 2)) from generate_series(0,7) t(n);
clock_timestamp | pg_sleep
-------------------------------+----------
2014-12-19 06:49:47.590556+00 |
2014-12-19 06:49:47.590611+00 |
2014-12-19 06:49:47.691488+00 |
2014-12-19 06:49:47.691508+00 |
2014-12-19 06:49:47.801483+00 |
2014-12-19 06:49:47.801502+00 |
2014-12-19 06:49:47.921486+00 |
2014-12-19 06:49:47.921505+00 |
(8 rows)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#149Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#138)
Re: tracking commit timestamps

On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Pushed with some extra cosmetic tweaks.

I got the following assertion failure when I executed pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion("!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0)))", File: "commit_ts.c",
Line: 315)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2014-12-04
12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by
signal 6: Aborted
2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running:
select pg_xact_commit_timestamp('1000'::xid);

The way to reproduce this problem is

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the standby

BTW, at the step #4, I got the following log messages. This might be a hint for
this problem.

LOG: file "pg_commit_ts/0000" doesn't exist, reading as zeroes
CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
relcache 16384

This problem still happens in the master.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150Craig Ringer
craig@2ndquadrant.com
In reply to: Noah Misch (#148)
Re: tracking commit timestamps

On 12/19/2014 02:53 PM, Noah Misch wrote:

The test assumed that no two transactions of a given backend will get the same
timestamp value from now(). That holds so long as ticks of the system time
are small enough. Not so on at least some Windows configurations.

Most Windows systems with nothing else running will have 15 ms timer
granularity. So multiple timestamps allocated within the same
millisecond will have the same value for timestamps captured within that
interval.

If you're running other programs that use the multimedia timer APIs
(including Google Chrome, MS SQL Server, and all sorts of other apps you
might not expect) you'll probably have 1ms timer granularity instead.

Since PostgreSQL 9.4 and below capture time on Windows using
GetSystemTime the sub-millisecond part is lost anyway. On 9.5 it's
retained but will usually be some fixed value because the timer tick is
still 1ms.

If you're on Windows 8 or Windows 2012 and running PostgreSQL 9.5
(master), but not earlier versions, you'll get sub-microsecond
resolution like on sensible platforms.

Some details here: https://github.com/2ndQuadrant/pg_sysdatetime

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#151Petr Jelinek
petr@2ndquadrant.com
In reply to: Fujii Masao (#149)
1 attachment(s)
Re: tracking commit timestamps

On 05/01/15 07:28, Fujii Masao wrote:

On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Pushed with some extra cosmetic tweaks.

I got the following assertion failure when I executed pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion("!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0)))", File: "commit_ts.c",
Line: 315)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2014-12-04
12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by
signal 6: Aborted
2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running:
select pg_xact_commit_timestamp('1000'::xid);

The way to reproduce this problem is

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the standby

BTW, at the step #4, I got the following log messages. This might be a hint for
this problem.

LOG: file "pg_commit_ts/0000" doesn't exist, reading as zeroes
CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
relcache 16384

This problem still happens in the master.

Regards,

Attached patch fixes it, I am not sure how happy I am with the way I did
it though.

And while at it I noticed that redo code for XLOG_PARAMETER_CHANGE sets
ControlFile->wal_log_hints = wal_log_hints;
shouldn't it be
ControlFile->wal_log_hints = xlrec.wal_log_hints;
instead?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

commit_ts_slave_activation_fix.patchtext/x-diff; name=commit_ts_slave_activation_fix.patchDownload
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index ca074da..fcfccf8 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -557,6 +557,12 @@ StartupCommitTs(void)
 	TransactionId xid = ShmemVariableCache->nextXid;
 	int			pageno = TransactionIdToCTsPage(xid);
 
+	if (track_commit_timestamp)
+	{
+		ActivateCommitTs();
+		return;
+	}
+
 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
 
 	/*
@@ -571,14 +577,25 @@ StartupCommitTs(void)
  * This must be called ONCE during postmaster or standalone-backend startup,
  * when commit timestamp is enabled.  Must be called after recovery has
  * finished.
+ */
+void
+CompleteCommitTsInitialization(void)
+{
+	if (!track_commit_timestamp)
+		DeactivateCommitTs(true);
+}
+
+/*
+ * This must be called when track_commit_timestamp is turned on.
+ * Note that this only happens during postmaster or standalone-backend startup
+ * or during WAL replay.
  *
  * This is in charge of creating the currently active segment, if it's not
  * already there.  The reason for this is that the server might have been
  * running with this module disabled for a while and thus might have skipped
  * the normal creation point.
  */
-void
-CompleteCommitTsInitialization(void)
+void ActivateCommitTs(void)
 {
 	TransactionId xid = ShmemVariableCache->nextXid;
 	int			pageno = TransactionIdToCTsPage(xid);
@@ -591,22 +608,6 @@ CompleteCommitTsInitialization(void)
 	LWLockRelease(CommitTsControlLock);
 
 	/*
-	 * If this module is not currently enabled, make sure we don't hand back
-	 * possibly-invalid data; also remove segments of old data.
-	 */
-	if (!track_commit_timestamp)
-	{
-		LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
-		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
-		ShmemVariableCache->newestCommitTs = InvalidTransactionId;
-		LWLockRelease(CommitTsLock);
-
-		TruncateCommitTs(ReadNewTransactionId());
-
-		return;
-	}
-
-	/*
 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
 	 * need to set the oldest and newest values to the next Xid; that way, we
 	 * will not try to read data that might not have been set.
@@ -641,6 +642,35 @@ CompleteCommitTsInitialization(void)
 }
 
 /*
+ * This must be called when track_commit_timestamp is turned off.
+ * Note that this only happens during postmaster or standalone-backend startup
+ * or during WAL replay.
+ *
+ * Resets CommitTs into invalid state to make sure we don't hand back
+ * possibly-invalid data; also removes segments of old data.
+ */
+void
+DeactivateCommitTs(bool do_wal)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	CommitTsCtl->shared->latest_page_number = pageno;
+	LWLockRelease(CommitTsControlLock);
+
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+	ShmemVariableCache->newestCommitTs = InvalidTransactionId;
+	LWLockRelease(CommitTsLock);
+
+	TruncateCommitTs(ReadNewTransactionId(), do_wal);
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
@@ -705,7 +735,7 @@ ExtendCommitTs(TransactionId newestXact)
  * Note that we don't need to flush XLOG here.
  */
 void
-TruncateCommitTs(TransactionId oldestXact)
+TruncateCommitTs(TransactionId oldestXact, bool do_wal)
 {
 	int			cutoffPage;
 
@@ -721,7 +751,8 @@ TruncateCommitTs(TransactionId oldestXact)
 		return;					/* nothing to remove */
 
 	/* Write XLOG record */
-	WriteTruncateXlogRec(cutoffPage);
+	if (do_wal)
+		WriteTruncateXlogRec(cutoffPage);
 
 	/* Now we can remove the old CommitTs segment(s) */
 	SimpleLruTruncate(CommitTsCtl, cutoffPage);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cc7e47..7ba7436 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5587,6 +5587,16 @@ do { \
 						minValue))); \
 } while(0)
 
+#define RecoveryRequiresBoolParameter(param_name, currValue, masterValue) \
+do { \
+	if (!(currValue) && (masterValue)) \
+		ereport(ERROR, \
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE), \
+				 errmsg("hot standby is not possible because " \
+						"%s is disabled but master server has it enabled ", \
+						param_name))); \
+} while(0)
+
 /*
  * Check to see if required parameters are set high enough on this server
  * for various aspects of recovery operation.
@@ -5629,6 +5639,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresBoolParameter("track_commit_timestamp",
+									  track_commit_timestamp,
+									  ControlFile->track_commit_timestamp);
 	}
 }
 
@@ -8968,7 +8981,6 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
-		ControlFile->track_commit_timestamp = track_commit_timestamp;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
@@ -8986,6 +8998,20 @@ xlog_redo(XLogReaderState *record)
 			ControlFile->minRecoveryPointTLI = ThisTimeLineID;
 		}
 
+		/*
+		 * Update the commit timestamp tracking. If there was a change
+		 * it needs to be activated or deactivated accordingly.
+		 */
+		if (track_commit_timestamp != xlrec.track_commit_timestamp)
+		{
+			track_commit_timestamp = xlrec.track_commit_timestamp;
+			ControlFile->track_commit_timestamp = track_commit_timestamp;
+			if (track_commit_timestamp)
+				ActivateCommitTs();
+			else
+				DeactivateCommitTs(false);
+		}
+
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e32e039..ced78ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1077,7 +1077,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
-	TruncateCommitTs(frozenXID);
+	TruncateCommitTs(frozenXID, true);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 903c82c..70ca968 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -39,11 +39,13 @@ extern Size CommitTsShmemSize(void);
 extern void CommitTsShmemInit(void);
 extern void BootStrapCommitTs(void);
 extern void StartupCommitTs(void);
+extern void ActivateCommitTs(void);
+extern void DeactivateCommitTs(bool do_wal);
 extern void CompleteCommitTsInitialization(void);
 extern void ShutdownCommitTs(void);
 extern void CheckPointCommitTs(void);
 extern void ExtendCommitTs(TransactionId newestXact);
-extern void TruncateCommitTs(TransactionId oldestXact);
+extern void TruncateCommitTs(TransactionId oldestXact, bool do_wal);
 extern void SetCommitTsLimit(TransactionId oldestXact,
 				 TransactionId newestXact);
 extern void AdvanceOldestCommitTs(TransactionId oldestXact);
#152Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#151)
1 attachment(s)
Re: tracking commit timestamps

On 05/01/15 16:17, Petr Jelinek wrote:

On 05/01/15 07:28, Fujii Masao wrote:

On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Pushed with some extra cosmetic tweaks.

I got the following assertion failure when I executed
pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion("!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0)))", File: "commit_ts.c",
Line: 315)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2014-12-04
12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by
signal 6: Aborted
2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running:
select pg_xact_commit_timestamp('1000'::xid);

The way to reproduce this problem is

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the
standby

BTW, at the step #4, I got the following log messages. This might be
a hint for
this problem.

LOG: file "pg_commit_ts/0000" doesn't exist, reading as zeroes
CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
relcache 16384

This problem still happens in the master.

Regards,

Attached patch fixes it, I am not sure how happy I am with the way I did
it though.

Added more comments and made the error message bit clearer.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

commit_ts_slave_activation_fix-v2.patchtext/x-diff; name=commit_ts_slave_activation_fix-v2.patchDownload
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index ca074da..59d19a0 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -557,6 +557,12 @@ StartupCommitTs(void)
 	TransactionId xid = ShmemVariableCache->nextXid;
 	int			pageno = TransactionIdToCTsPage(xid);
 
+	if (track_commit_timestamp)
+	{
+		ActivateCommitTs();
+		return;
+	}
+
 	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
 
 	/*
@@ -571,14 +577,30 @@ StartupCommitTs(void)
  * This must be called ONCE during postmaster or standalone-backend startup,
  * when commit timestamp is enabled.  Must be called after recovery has
  * finished.
+ */
+void
+CompleteCommitTsInitialization(void)
+{
+	if (!track_commit_timestamp)
+		DeactivateCommitTs(true);
+}
+
+/*
+ * This must be called when track_commit_timestamp is turned on.
+ * Note that this only happens during postmaster or standalone-backend startup
+ * or during WAL replay.
+ *
+ * The reason why this SLRU needs separate activation/deactivation functions is
+ * that it can be enabled/disabled during start and the activation/deactivation
+ * on master is propagated to slave via replay. Other SLRUs don't have this
+ * property and they can be just initialized during normal startup.
  *
  * This is in charge of creating the currently active segment, if it's not
  * already there.  The reason for this is that the server might have been
  * running with this module disabled for a while and thus might have skipped
  * the normal creation point.
  */
-void
-CompleteCommitTsInitialization(void)
+void ActivateCommitTs(void)
 {
 	TransactionId xid = ShmemVariableCache->nextXid;
 	int			pageno = TransactionIdToCTsPage(xid);
@@ -591,22 +613,6 @@ CompleteCommitTsInitialization(void)
 	LWLockRelease(CommitTsControlLock);
 
 	/*
-	 * If this module is not currently enabled, make sure we don't hand back
-	 * possibly-invalid data; also remove segments of old data.
-	 */
-	if (!track_commit_timestamp)
-	{
-		LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
-		ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
-		ShmemVariableCache->newestCommitTs = InvalidTransactionId;
-		LWLockRelease(CommitTsLock);
-
-		TruncateCommitTs(ReadNewTransactionId());
-
-		return;
-	}
-
-	/*
 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
 	 * need to set the oldest and newest values to the next Xid; that way, we
 	 * will not try to read data that might not have been set.
@@ -641,6 +647,35 @@ CompleteCommitTsInitialization(void)
 }
 
 /*
+ * This must be called when track_commit_timestamp is turned off.
+ * Note that this only happens during postmaster or standalone-backend startup
+ * or during WAL replay.
+ *
+ * Resets CommitTs into invalid state to make sure we don't hand back
+ * possibly-invalid data; also removes segments of old data.
+ */
+void
+DeactivateCommitTs(bool do_wal)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int			pageno = TransactionIdToCTsPage(xid);
+
+	/*
+	 * Re-Initialize our idea of the latest page number.
+	 */
+	LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE);
+	CommitTsCtl->shared->latest_page_number = pageno;
+	LWLockRelease(CommitTsControlLock);
+
+	LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
+	ShmemVariableCache->oldestCommitTs = InvalidTransactionId;
+	ShmemVariableCache->newestCommitTs = InvalidTransactionId;
+	LWLockRelease(CommitTsLock);
+
+	TruncateCommitTs(ReadNewTransactionId(), do_wal);
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
@@ -705,7 +740,7 @@ ExtendCommitTs(TransactionId newestXact)
  * Note that we don't need to flush XLOG here.
  */
 void
-TruncateCommitTs(TransactionId oldestXact)
+TruncateCommitTs(TransactionId oldestXact, bool do_wal)
 {
 	int			cutoffPage;
 
@@ -721,7 +756,8 @@ TruncateCommitTs(TransactionId oldestXact)
 		return;					/* nothing to remove */
 
 	/* Write XLOG record */
-	WriteTruncateXlogRec(cutoffPage);
+	if (do_wal)
+		WriteTruncateXlogRec(cutoffPage);
 
 	/* Now we can remove the old CommitTs segment(s) */
 	SimpleLruTruncate(CommitTsCtl, cutoffPage);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cc7e47..4117560 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5587,6 +5587,21 @@ do { \
 						minValue))); \
 } while(0)
 
+#define RecoveryRequiresBoolParameter(param_name, currValue, masterValue) \
+do { \
+	bool _currValue = (currValue); \
+	bool _masterValue = (masterValue); \
+	if (_currValue != _masterValue) \
+		ereport(ERROR, \
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE), \
+				 errmsg("hot standby is not possible because it requires " \
+						"\"%s\" to be same on master and standby" \
+						"(master has \"%s\", standby has \"%s\")", \
+						param_name, \
+						_masterValue ? "true" : "false", \
+						_currValue ? "true" : "false"))); \
+} while(0)
+
 /*
  * Check to see if required parameters are set high enough on this server
  * for various aspects of recovery operation.
@@ -5629,6 +5644,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresBoolParameter("track_commit_timestamp",
+									  track_commit_timestamp,
+									  ControlFile->track_commit_timestamp);
 	}
 }
 
@@ -8968,7 +8986,6 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
-		ControlFile->track_commit_timestamp = track_commit_timestamp;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
@@ -8986,6 +9003,25 @@ xlog_redo(XLogReaderState *record)
 			ControlFile->minRecoveryPointTLI = ThisTimeLineID;
 		}
 
+		/*
+		 * Update the commit timestamp tracking. If there was a change
+		 * it needs to be activated or deactivated accordingly.
+		 */
+		if (track_commit_timestamp != xlrec.track_commit_timestamp)
+		{
+			track_commit_timestamp = xlrec.track_commit_timestamp;
+			ControlFile->track_commit_timestamp = track_commit_timestamp;
+			if (track_commit_timestamp)
+				ActivateCommitTs();
+			else
+				/*
+				 * Recovery can't create ne WAL records, but that's ok as
+				 * master did the WAL logging and we will replay the record
+				 * from master in case we crash here.
+				 */
+				DeactivateCommitTs(false);
+		}
+
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e32e039..ced78ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1077,7 +1077,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 * checkpoint.
 	 */
 	TruncateCLOG(frozenXID);
-	TruncateCommitTs(frozenXID);
+	TruncateCommitTs(frozenXID, true);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 903c82c..70ca968 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -39,11 +39,13 @@ extern Size CommitTsShmemSize(void);
 extern void CommitTsShmemInit(void);
 extern void BootStrapCommitTs(void);
 extern void StartupCommitTs(void);
+extern void ActivateCommitTs(void);
+extern void DeactivateCommitTs(bool do_wal);
 extern void CompleteCommitTsInitialization(void);
 extern void ShutdownCommitTs(void);
 extern void CheckPointCommitTs(void);
 extern void ExtendCommitTs(TransactionId newestXact);
-extern void TruncateCommitTs(TransactionId oldestXact);
+extern void TruncateCommitTs(TransactionId oldestXact, bool do_wal);
 extern void SetCommitTsLimit(TransactionId oldestXact,
 				 TransactionId newestXact);
 extern void AdvanceOldestCommitTs(TransactionId oldestXact);
#153Michael Paquier
michael.paquier@gmail.com
In reply to: Noah Misch (#148)
Re: tracking commit timestamps

On Fri, Dec 19, 2014 at 3:53 PM, Noah Misch <noah@leadboat.com> wrote:

localhost template1=# select clock_timestamp(), pg_sleep(.1 * (n % 2)) from generate_series(0,7) t(n);
clock_timestamp | pg_sleep
-------------------------------+----------
2014-12-18 08:34:34.522126+00 |
2014-12-18 08:34:34.522126+00 |
2014-12-18 08:34:34.631508+00 |
2014-12-18 08:34:34.631508+00 |
2014-12-18 08:34:34.74089+00 |
2014-12-18 08:34:34.74089+00 |
2014-12-18 08:34:34.850272+00 |
2014-12-18 08:34:34.850272+00 |
(8 rows)

So, we would need additional information other than the node ID *and*
the timestamp to ensure proper transaction commit ordering on Windows.
That's not cool and makes this feature very limited on this platform.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#154Petr Jelinek
petr@2ndquadrant.com
In reply to: Michael Paquier (#153)
Re: tracking commit timestamps

On 06/01/15 08:58, Michael Paquier wrote:

On Fri, Dec 19, 2014 at 3:53 PM, Noah Misch <noah@leadboat.com> wrote:

localhost template1=# select clock_timestamp(), pg_sleep(.1 * (n % 2)) from generate_series(0,7) t(n);
clock_timestamp | pg_sleep
-------------------------------+----------
2014-12-18 08:34:34.522126+00 |
2014-12-18 08:34:34.522126+00 |
2014-12-18 08:34:34.631508+00 |
2014-12-18 08:34:34.631508+00 |
2014-12-18 08:34:34.74089+00 |
2014-12-18 08:34:34.74089+00 |
2014-12-18 08:34:34.850272+00 |
2014-12-18 08:34:34.850272+00 |
(8 rows)

So, we would need additional information other than the node ID *and*
the timestamp to ensure proper transaction commit ordering on Windows.
That's not cool and makes this feature very limited on this platform.

Well that's Windows time api for you, it affects everything that deals
with timestamps though, not just commit ts. Note that the precision
depends on hardware and other software that was running on the computer
(there is undocumented api to increase the resolution, also use of
multimedia timer increases resolution, etc).

The good news is that MS provides new high precision time API in Windows
8 and Windows Server 2012 which we are using thanks to
519b0757a37254452e013ea0ac95f4e56391608c so we are good at least on
modern systems.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#153)
Re: tracking commit timestamps

On Tue, Jan 6, 2015 at 2:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So, we would need additional information other than the node ID *and*
the timestamp to ensure proper transaction commit ordering on Windows.
That's not cool and makes this feature very limited on this platform.

You can't use the timestamp alone for commit ordering on any platform.
Eventually, two transactions will manage to commit in a single clock
tick, no matter how short that is.

Now, if we'd included the LSN in there...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#152)
Re: tracking commit timestamps

On 05/01/15 17:50, Petr Jelinek wrote:

On 05/01/15 16:17, Petr Jelinek wrote:

On 05/01/15 07:28, Fujii Masao wrote:

On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Pushed with some extra cosmetic tweaks.

I got the following assertion failure when I executed
pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion("!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0)))", File: "commit_ts.c",
Line: 315)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2014-12-04
12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by
signal 6: Aborted
2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running:
select pg_xact_commit_timestamp('1000'::xid);

The way to reproduce this problem is

#1. set up and start the master and standby servers with
track_commit_timestamp disabled
#2. enable track_commit_timestamp in the master and restart the master
#3. run some write transactions
#4. enable track_commit_timestamp in the standby and restart the
standby
#5. execute "select pg_xact_commit_timestamp('1000'::xid)" in the
standby

BTW, at the step #4, I got the following log messages. This might be
a hint for
this problem.

LOG: file "pg_commit_ts/0000" doesn't exist, reading as zeroes
CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09;
inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache
45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608
relcache 16384

This problem still happens in the master.

Regards,

Attached patch fixes it, I am not sure how happy I am with the way I did
it though.

Added more comments and made the error message bit clearer.

Fujii, Alvaro, did one of you had chance to look at this fix?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Petr Jelinek (#152)
Re: tracking commit timestamps

Petr Jelinek wrote:

On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

I got the following assertion failure when I executed
pg_xact_commit_timestamp()
in the standby server.

=# select pg_xact_commit_timestamp('1000'::xid);
TRAP: FailedAssertion("!(((oldestCommitTs) != ((TransactionId) 0)) ==
((newestCommitTs) != ((TransactionId) 0)))", File: "commit_ts.c",
Line: 315)

Attached patch fixes it, I am not sure how happy I am with the way I did
it though.

Pushed, thanks.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers