synchronized snapshots

Started by Joachim Wielandover 14 years ago49 messages

Joachim Wieland

joe@mcknight.de

over 14 years ago

1 attachment(s)

This is a patch to implement synchronized snapshots. It is based on
Alvaro's specifications in:

http://archives.postgresql.org/pgsql-hackers/2011-02/msg02074.php

In short, this is how it works:

SELECT pg_export_snapshot();
pg_export_snapshot
--------------------
000003A1-1
(1 row)

(and then in a different session)

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '000003A1-1');

The one thing that it does not implement is leaving the transaction in
an aborted state if the BEGIN TRANSACTION command failed for an
invalid snapshot identifier. I can certainly see that this would be
useful but I am not sure if it justifies introducing this
inconsistency. We would have a BEGIN TRANSACTION command that left the
session in a different state depending on why it failed...

Also I was unsure if we really need to do further checking beyond the
existence of the file, why exactly is this necessary?

The patch is adding an extra "stemplate" parameter to the GetSnapshot
functions, the primary reason for this is to make it work with SSI,
which gets a snapshot and then does stuff with it. The alternative
would have been splitting up the SSI function so that we can smuggle
in our own snapshot but that didn't seem to be less ugly. The way it
works now is that the lowest function checks if a template is being
passed from higher up and if so, it doesn't get a fresh snapshot but
returns just a copy of the template.

I am wondering if pg_export_snapshot() is still the right name, since
the snapshot is no longer exported to the user. It is exported to a
file but that's an implementation detail.

Joachim

Attachments:

syncSnapshots.1.difftext/x-patch; charset=US-ASCII; name=syncSnapshots.1.diffDownload

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4c3e232..d7d2fbe 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** FOR EACH ROW EXECUTE PROCEDURE suppress_
*** 15012,15015 ****
--- 15012,15089 ----
          <xref linkend="SQL-CREATETRIGGER">.
      </para>
    </sect1>
+ 
+   <sect1 id="functions-snapshotsync">
+    <title>Snapshot Synchronization Functions</title>
+ 
+    <indexterm>
+      <primary>pg_export_snapshot</primary>
+    </indexterm>
+ 
+    <para>
+      <productname>PostgreSQL</> allows different sessions to synchronize their
+      snapshots. A database snapshot determines which data is visible to
+      the client that is using this snapshot. Synchronized snapshots are necessary when
+      two clients need to see the same content in the database. If these two clients
+      just connected to the database and opened their transactions, then they could
+      never be sure that there was no data modification right between both
+      connections.
+    </para>
+    <para>
+      As a solution, <productname>PostgreSQL</> offers the function
+      <function>pg_export_snapshot</> which saves the snapshot internally and
+      from then on until the end of the saving transaction, the snapshot can be
+      used on a <xref linkend="sql-begin"> (or <xref
+      linkend="sql-start-transaction">) command to open a second transaction with the
+      exact same snapshot. Now both transactions are guaranteed to see the exact same
+      data even though they might have connected at different times.
+    </para>
+    <para>
+      Note that a snapshot can only be used to start a new transaction as long
+      as the transaction that originally saved it is held open. Also note that even
+      after the synchronization both clients still run their own independent
+      transactions. As a consequence, even though synchronized with respect to
+      reading pre-existing data, both transactions won't be able to see each other's
+      uncommitted data.
+    </para>
+    <table id="functions-snapshot-synchronization">
+     <title>Snapshot Synchronization Functions</title>
+     <tgroup cols="3">
+      <thead>
+       <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+       </row>
+      </thead>
+ 
+      <tbody>
+       <row>
+        <entry>
+         <literal><function>pg_export_snapshot()</function></literal>
+        </entry>
+        <entry><type>text</type></entry>
+        <entry>Save the snapshot and return its identifier</entry>
+       </row>
+      </tbody>
+     </tgroup>
+    </table>
+ 
+    <para>
+       The function <function>pg_export_snapshot</> does not take an argument
+       and returns the snapshot's identifier as <type>text</type> data. Internally the
+       function will save the snapshot data to a file so that it can be retrieved
+       from a different backend process later on. Note that as soon as the
+       transaction ends, any saved snapshots become invalid and their
+       identifiers cannot be used to start other transactions anymore. If the function
+       has been executed, the transaction cannot be prepared anymore with <xref
+       linkend="sql-prepare-transaction">.
+    </para>
+ <programlisting>
+ SELECT pg_export_snapshot();
+ 
+  pg_export_snapshot
+ --------------------
+  000003A1-1
+ (1 row)
+ </programlisting>
+   </sect1>
  </chapter>
+ 
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index acd8232..d5fb728 100644
*** a/doc/src/sgml/ref/begin.sgml
--- b/doc/src/sgml/ref/begin.sgml
*************** BEGIN [ WORK | TRANSACTION ] [ <replacea
*** 28,33 ****
--- 28,34 ----
      ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
      READ WRITE | READ ONLY
      [ NOT ] DEFERRABLE
+     ( SNAPSHOT = snapshot_id )
  </synopsis>
   </refsynopsisdiv>
  
*************** BEGIN [ WORK | TRANSACTION ] [ <replacea
*** 78,83 ****
--- 79,100 ----
       </para>
      </listitem>
     </varlistentry>
+ 
+    <varlistentry>
+     <term><literal>BEGIN ... (SNAPSHOT = snapshot_id )</literal></term>
+     <listitem>
+      <para>
+ 	  Use the <literal>(SNAPSHOT = snapshot_id)</literal> parameter to start a
+ 	  new transaction with the same snapshot as an already running transaction.
+       A call to <literal>pg_export_snapshot</literal> (see <xref
+ 	  linkend="functions-snapshotsync">) returns a snapshot id which must be
+ 	  passed to the <literal>BEGIN</literal> command to create a second
+       transaction running with the same snapshot. You also need to make the
+       transaction <literal>ISOLATION LEVEL SERIALIZABLE</literal> or
+       <literal>ISOLATION LEVEL REPEATABLE READ</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
  
    <para>
*************** BEGIN [ WORK | TRANSACTION ] [ <replacea
*** 123,128 ****
--- 140,167 ----
  <programlisting>
  BEGIN;
  </programlisting></para>
+ 
+   <para>
+    To begin a new transaction block with the same snapshot as an already
+    existing transaction, first export the snapshot from the existing
+    transaction. This will return the snapshot id:
+ 
+ <programlisting>
+ # SELECT pg_export_snapshot();
+  pg_export_snapshot
+ --------------------
+  000003A1-1
+ (1 row)
+ </programlisting>
+ 
+    Then reference this snapshot id on the BEGIN TRANSACTION command:
+ 
+ <programlisting>
+ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '000003A1-1');
+ </programlisting>
+   </para>
+ 
+ 
   </refsect1>
  
   <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index f25a3e9..0576faf 100644
*** a/doc/src/sgml/ref/start_transaction.sgml
--- b/doc/src/sgml/ref/start_transaction.sgml
*************** START TRANSACTION [ <replaceable class="
*** 28,33 ****
--- 28,34 ----
      ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
      READ WRITE | READ ONLY
      [ NOT ] DEFERRABLE
+     ( SNAPSHOT = snapshot_id )
  </synopsis>
   </refsynopsisdiv>
  
*************** START TRANSACTION [ <replaceable class="
*** 46,53 ****
    <title>Parameters</title>
  
    <para>
!    Refer to <xref linkend="sql-set-transaction"> for information on the meaning
!    of the parameters to this statement.
    </para>
   </refsect1>
  
--- 47,54 ----
    <title>Parameters</title>
  
    <para>
!    Refer to <xref linkend="sql-set-transaction"> and <xref linkend="sql-begin">
!    for information on the meaning of the parameters to this statement.
    </para>
   </refsect1>
  
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e8821f7..e372c80 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** CommitTransaction(void)
*** 1855,1860 ****
--- 1855,1866 ----
  	 */
  	PreCommit_Notify();
  
+ 	/*
+ 	 * Cleans up exported snapshots (this needs to happen before we update
+ 	 * our MyProc entry).
+ 	 */
+ 	PreCommit_Snapshot();
+ 
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
*************** PrepareTransaction(void)
*** 2073,2078 ****
--- 2079,2089 ----
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  				 errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
  
+ 	if (exportedSnapshots)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				 errmsg("cannot PREPARE a transaction that has exported snapshots")));
+ 
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11035e6..611ac81 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 58,63 ****
--- 58,64 ----
  #include "utils/guc.h"
  #include "utils/ps_status.h"
  #include "utils/relmapper.h"
+ #include "utils/snapmgr.h"
  #include "pg_trace.h"
  
  
*************** StartupXLOG(void)
*** 6368,6373 ****
--- 6369,6379 ----
  		CheckRequiredParameterValues();
  
  		/*
+ 		 * We can delete any saved transaction snapshots that still exist
+ 		 */
+ 		DeleteAllExportedSnapshotFiles();
+ 
+ 		/*
  		 * We're in recovery, so unlogged relations relations may be trashed
  		 * and must be reset.  This should be done BEFORE allowing Hot Standby
  		 * connections, so that read-only backends don't try to read whatever
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e9f3896..8bdf096 100644
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*************** transaction_mode_item:
*** 7252,7257 ****
--- 7252,7260 ----
  			| NOT DEFERRABLE
  					{ $$ = makeDefElem("transaction_deferrable",
  									   makeIntConst(FALSE, @1)); }
+ 			| '(' ColId '=' Sconst ')'
+ 					{ $$ = makeDefElem($2,
+ 									   makeStringConst($4, @4)); }
  		;
  
  /* Syntax with commas is SQL-spec, without commas is Postgres historical */
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e7593fa..8b98e4e 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** ProcArrayShmemSize(void)
*** 167,175 ****
  {
  	Size		size;
  
- 	/* Size of the ProcArray structure itself */
- #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
- 
  	size = offsetof(ProcArrayStruct, procs);
  	size = add_size(size, mul_size(sizeof(PGPROC *), PROCARRAY_MAXPROCS));
  
--- 167,172 ----
*************** ProcArrayShmemSize(void)
*** 180,193 ****
  	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
  	 * main structures created in those functions must be identically sized,
  	 * since we may at times copy the whole of the data structures around. We
! 	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
  	 *
  	 * Ideally we'd only create this structure if we were actually doing hot
  	 * standby in the current run, but we don't know that yet at the time
  	 * shared memory is being set up.
  	 */
- #define TOTAL_MAX_CACHED_SUBXIDS \
- 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
  
  	if (EnableHotStandby)
  	{
--- 177,188 ----
  	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
  	 * main structures created in those functions must be identically sized,
  	 * since we may at times copy the whole of the data structures around. We
! 	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS, defined in procarray.h.
  	 *
  	 * Ideally we'd only create this structure if we were actually doing hot
  	 * standby in the current run, but we don't know that yet at the time
  	 * shared memory is being set up.
  	 */
  
  	if (EnableHotStandby)
  	{
*************** GetOldestXmin(bool allDbs, bool ignoreVa
*** 1145,1151 ****
   * not statically allocated (see xip allocation below).
   */
  Snapshot
! GetSnapshotData(Snapshot snapshot)
  {
  	ProcArrayStruct *arrayP = procArray;
  	TransactionId xmin;
--- 1140,1146 ----
   * not statically allocated (see xip allocation below).
   */
  Snapshot
! GetSnapshotData(Snapshot snapshot, Snapshot stemplate)
  {
  	ProcArrayStruct *arrayP = procArray;
  	TransactionId xmin;
*************** GetSnapshotData(Snapshot snapshot)
*** 1159,1164 ****
--- 1154,1182 ----
  	Assert(snapshot != NULL);
  
  	/*
+ 	 * We only get a valid snapshot in stemplate if the snapshot
+ 	 * synchronization feature used. In that case we just need to copy the
+ 	 * values that we get onto the snapshot we return.
+ 	 * Note that in this case we always duplicate an existing snapshot, that is
+ 	 * currently held by another active transaction. That's why we do not need
+ 	 * to update any { RecentGlobalXmin, RecentXmin, globalxmin }.
+ 	 */
+ 	if (stemplate != InvalidSnapshot)
+ 	{
+ 		/*
+ 		 * 'stemplate' is only read and its values are copied onto 'snapshot'.
+ 		 */
+ 		CopySnapshotOnto(stemplate, snapshot);
+ 
+ 		/*
+ 		 * We can use the result of the copy except for that this snapshot
+ 		 * should look like new and not copied.
+ 		 */
+ 		snapshot->copied = false;
+ 		return snapshot;
+ 	}
+ 
+ 	/*
  	 * Allocating space for maxProcs xids is usually overkill; numProcs would
  	 * be sufficient.  But it seems better to do the malloc while not holding
  	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8e7a7f0..cca0bef 100644
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
*************** static void OldSerXidSetActiveSerXmin(Tr
*** 416,423 ****
  
  static uint32 predicatelock_hash(const void *key, Size keysize);
  static void SummarizeOldestCommittedSxact(void);
! static Snapshot GetSafeSnapshot(Snapshot snapshot);
! static Snapshot RegisterSerializableTransactionInt(Snapshot snapshot);
  static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
  static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
  						  PREDICATELOCKTARGETTAG *parent);
--- 416,423 ----
  
  static uint32 predicatelock_hash(const void *key, Size keysize);
  static void SummarizeOldestCommittedSxact(void);
! static Snapshot GetSafeSnapshot(Snapshot snapshot, Snapshot stemplate);
! static Snapshot RegisterSerializableTransactionInt(Snapshot snapshot, Snapshot stemplate);
  static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
  static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
  						  PREDICATELOCKTARGETTAG *parent);
*************** SummarizeOldestCommittedSxact(void)
*** 1487,1493 ****
   *		one of them could possibly create a conflict.
   */
  static Snapshot
! GetSafeSnapshot(Snapshot origSnapshot)
  {
  	Snapshot	snapshot;
  
--- 1487,1493 ----
   *		one of them could possibly create a conflict.
   */
  static Snapshot
! GetSafeSnapshot(Snapshot origSnapshot, Snapshot stemplate)
  {
  	Snapshot	snapshot;
  
*************** GetSafeSnapshot(Snapshot origSnapshot)
*** 1501,1507 ****
  		 * caller passed to us. It returns a copy of that snapshot and
  		 * registers it on TopTransactionResourceOwner.
  		 */
! 		snapshot = RegisterSerializableTransactionInt(origSnapshot);
  
  		if (MySerializableXact == InvalidSerializableXact)
  			return snapshot;	/* no concurrent r/w xacts; it's safe */
--- 1501,1507 ----
  		 * caller passed to us. It returns a copy of that snapshot and
  		 * registers it on TopTransactionResourceOwner.
  		 */
! 		snapshot = RegisterSerializableTransactionInt(origSnapshot, stemplate);
  
  		if (MySerializableXact == InvalidSerializableXact)
  			return snapshot;	/* no concurrent r/w xacts; it's safe */
*************** GetSafeSnapshot(Snapshot origSnapshot)
*** 1554,1560 ****
   * It should be current for this process and be contained in PredXact.
   */
  Snapshot
! RegisterSerializableTransaction(Snapshot snapshot)
  {
  	Assert(IsolationIsSerializable());
  
--- 1554,1560 ----
   * It should be current for this process and be contained in PredXact.
   */
  Snapshot
! RegisterSerializableTransaction(Snapshot snapshot, Snapshot stemplate)
  {
  	Assert(IsolationIsSerializable());
  
*************** RegisterSerializableTransaction(Snapshot
*** 1564,1576 ****
  	 * thereby avoid all SSI overhead once it's running..
  	 */
  	if (XactReadOnly && XactDeferrable)
! 		return GetSafeSnapshot(snapshot);
  
! 	return RegisterSerializableTransactionInt(snapshot);
  }
  
  static Snapshot
! RegisterSerializableTransactionInt(Snapshot snapshot)
  {
  	PGPROC	   *proc;
  	VirtualTransactionId vxid;
--- 1564,1576 ----
  	 * thereby avoid all SSI overhead once it's running..
  	 */
  	if (XactReadOnly && XactDeferrable)
! 		return GetSafeSnapshot(snapshot, stemplate);
  
! 	return RegisterSerializableTransactionInt(snapshot, stemplate);
  }
  
  static Snapshot
! RegisterSerializableTransactionInt(Snapshot snapshot, Snapshot stemplate)
  {
  	PGPROC	   *proc;
  	VirtualTransactionId vxid;
*************** RegisterSerializableTransactionInt(Snaps
*** 1608,1614 ****
  	} while (!sxact);
  
  	/* Get and register a snapshot */
! 	snapshot = GetSnapshotData(snapshot);
  	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
  
  	/*
--- 1608,1614 ----
  	} while (!sxact);
  
  	/* Get and register a snapshot */
! 	snapshot = GetSnapshotData(snapshot, stemplate);
  	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
  
  	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 0749227..8dbcc40 100644
*** a/src/backend/tcop/utility.c
--- b/src/backend/tcop/utility.c
***************
*** 58,63 ****
--- 58,64 ----
  #include "tcop/utility.h"
  #include "utils/acl.h"
  #include "utils/guc.h"
+ #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  
  
*************** standard_ProcessUtility(Node *parsetree,
*** 375,380 ****
--- 376,382 ----
  					case TRANS_STMT_START:
  						{
  							ListCell   *lc;
+ 							char	   *snapshotId = NULL;
  
  							BeginTransactionBlock();
  							foreach(lc, stmt->options)
*************** standard_ProcessUtility(Node *parsetree,
*** 393,398 ****
--- 395,411 ----
  									SetPGVariable("transaction_deferrable",
  												  list_make1(item->arg),
  												  true);
+ 								else if (strcmp(item->defname, "snapshot") == 0)
+ 									/*
+ 									 * Only save the snapshot's id for now, so that we
+ 									 * process any other modifiers first.
+ 									 */
+ 									snapshotId = ((A_Const *) item->arg)->val.val.str;
+ 							}
+ 							if (snapshotId)
+ 							{
+ 								if (!ImportSnapshot(snapshotId))
+ 									elog(ERROR, "Could not import the requested snapshot");
  							}
  						}
  						break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6670997..6c583bb 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** ExecSetVariableStmt(VariableSetStmt *stm
*** 6115,6120 ****
--- 6115,6125 ----
  					else if (strcmp(item->defname, "transaction_deferrable") == 0)
  						SetPGVariable("transaction_deferrable",
  									  list_make1(item->arg), stmt->is_local);
+ 					else if (strcmp(item->defname, "snapshot") == 0)
+ 						ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 							 errmsg("cannot set a snapshot with SET TRANSACTION"),
+ 							 errhint("use BEGIN/START TRANSACTION instead")));
  					else
  						elog(ERROR, "unexpected SET TRANSACTION element: %s",
  							 item->defname);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ef66466..52ded72 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 25,36 ****
   */
  #include "postgres.h"
  
  #include "access/transam.h"
  #include "access/xact.h"
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
! #include "utils/memutils.h"
  #include "utils/memutils.h"
  #include "utils/resowner.h"
  #include "utils/snapmgr.h"
--- 25,42 ----
   */
  #include "postgres.h"
  
+ #include <sys/types.h>
+ #include <sys/stat.h>
+ #include <unistd.h>
+ 
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "miscadmin.h"
+ #include "storage/fd.h"
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
! #include "utils/builtins.h"
  #include "utils/memutils.h"
  #include "utils/resowner.h"
  #include "utils/snapmgr.h"
*************** static Snapshot CopySnapshot(Snapshot sn
*** 109,126 ****
  static void FreeSnapshot(Snapshot snapshot);
  static void SnapshotResetXmin(void);
  
  
  /*
!  * GetTransactionSnapshot
   *		Get the appropriate snapshot for a new query in a transaction.
   *
!  * Note that the return value may point at static storage that will be modified
!  * by future calls and by CommandCounterIncrement().  Callers should call
!  * RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
!  * used very long.
   */
! Snapshot
! GetTransactionSnapshot(void)
  {
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
--- 115,138 ----
  static void FreeSnapshot(Snapshot snapshot);
  static void SnapshotResetXmin(void);
  
+ /* What we need for exporting snapshots */
+ #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
+ #define XactExportFilePath(path, xid, num, suffix) \
+     snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%08X-%d%s", xid, num, suffix)
+ 
+ List *exportedSnapshots = NIL;
  
  /*
!  * GetTransactionSnapshotFromTemplate
   *		Get the appropriate snapshot for a new query in a transaction.
   *
!  * A template snapshot is passed for the synchronized snapshots feature.
!  * In that case we want to have a snapshot back that has the template's
!  * values. We just pass it along and the lower level functions take care
!  * of it.
   */
! static Snapshot
! GetTransactionSnapshotFromTemplate(Snapshot stemplate)
  {
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
*************** GetTransactionSnapshot(void)
*** 135,151 ****
  		if (IsolationUsesXactSnapshot())
  		{
  			if (IsolationIsSerializable())
! 				CurrentSnapshot = RegisterSerializableTransaction(&CurrentSnapshotData);
  			else
  			{
! 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  				CurrentSnapshot = RegisterSnapshotOnOwner(CurrentSnapshot,
  												TopTransactionResourceOwner);
  			}
  			registered_xact_snapshot = true;
  		}
  		else
! 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
--- 147,171 ----
  		if (IsolationUsesXactSnapshot())
  		{
  			if (IsolationIsSerializable())
! 				CurrentSnapshot = RegisterSerializableTransaction(&CurrentSnapshotData,
! 																  stemplate);
  			else
  			{
! 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, stemplate);
  				CurrentSnapshot = RegisterSnapshotOnOwner(CurrentSnapshot,
  												TopTransactionResourceOwner);
  			}
  			registered_xact_snapshot = true;
  		}
  		else
! 		{
! 			/*
! 			 * template is only used for the synchronized snapshot feature. Which in
! 			 * turn is only allowed for IsolationUsesXactSnapshot() == true transactions
! 			 */
! 			Assert(stemplate == InvalidSnapshot);
! 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, InvalidSnapshot);
! 		}
  
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
*************** GetTransactionSnapshot(void)
*** 154,165 ****
  	if (IsolationUsesXactSnapshot())
  		return CurrentSnapshot;
  
! 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  
  	return CurrentSnapshot;
  }
  
  /*
   * GetLatestSnapshot
   *		Get a snapshot that is up-to-date as of the current instant,
   *		even if we are executing in transaction-snapshot mode.
--- 174,206 ----
  	if (IsolationUsesXactSnapshot())
  		return CurrentSnapshot;
  
! 	/* see comment above */
! 	Assert(stemplate == InvalidSnapshot);
! 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, InvalidSnapshot);
  
  	return CurrentSnapshot;
  }
  
  /*
+  * GetTransactionSnapshot
+  *		Get the appropriate snapshot for a new query in a transaction.
+  *
+  * This is the public interface for anything different than the snapshot
+  * synchronization feature.
+  *
+  * Note that the return value may point at static storage that will be modified
+  * by future calls and by CommandCounterIncrement().  Callers should call
+  * RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
+  * used very long.
+  */
+ Snapshot
+ GetTransactionSnapshot(void)
+ {
+ 	return GetTransactionSnapshotFromTemplate(InvalidSnapshot);
+ }
+ 
+ 
+ /*
   * GetLatestSnapshot
   *		Get a snapshot that is up-to-date as of the current instant,
   *		even if we are executing in transaction-snapshot mode.
*************** GetLatestSnapshot(void)
*** 171,177 ****
  	if (!FirstSnapshotSet)
  		elog(ERROR, "no snapshot has been set");
  
! 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
  
  	return SecondarySnapshot;
  }
--- 212,218 ----
  	if (!FirstSnapshotSet)
  		elog(ERROR, "no snapshot has been set");
  
! 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData, InvalidSnapshot);
  
  	return SecondarySnapshot;
  }
*************** SnapshotSetCommandId(CommandId curcid)
*** 193,235 ****
  }
  
  /*
!  * CopySnapshot
!  *		Copy the given snapshot.
   *
!  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
!  * to 0.  The returned snapshot has the copied flag set.
   */
! static Snapshot
! CopySnapshot(Snapshot snapshot)
  {
- 	Snapshot	newsnap;
  	Size		subxipoff;
- 	Size		size;
- 
- 	Assert(snapshot != InvalidSnapshot);
  
! 	/* We allocate any XID arrays needed in the same palloc block. */
! 	size = subxipoff = sizeof(SnapshotData) +
! 		snapshot->xcnt * sizeof(TransactionId);
! 	if (snapshot->subxcnt > 0)
! 		size += snapshot->subxcnt * sizeof(TransactionId);
! 
! 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
! 	memcpy(newsnap, snapshot, sizeof(SnapshotData));
  
! 	newsnap->regd_count = 0;
! 	newsnap->active_count = 0;
! 	newsnap->copied = true;
  
  	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
! 		newsnap->xip = (TransactionId *) (newsnap + 1);
! 		memcpy(newsnap->xip, snapshot->xip,
  			   snapshot->xcnt * sizeof(TransactionId));
  	}
  	else
! 		newsnap->xip = NULL;
  
  	/*
  	 * Setup subXID array. Don't bother to copy it if it had overflowed,
--- 234,266 ----
  }
  
  /*
!  * CopySnapshotOnto
!  *      Copy the given snapshot onto an already sufficiently allocated other
!  *      snapshot.
   *
!  * Return the modified snapshot (onto).
   */
! Snapshot
! CopySnapshotOnto(Snapshot snapshot, Snapshot onto)
  {
  	Size		subxipoff;
  
! 	subxipoff = sizeof(SnapshotData) + snapshot->xcnt * sizeof(TransactionId);
! 	memcpy(onto, snapshot, sizeof(SnapshotData));
  
! 	onto->regd_count = 0;
! 	onto->active_count = 0;
! 	onto->copied = true;
  
  	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
! 		onto->xip = (TransactionId *) (onto + 1);
! 		memcpy(onto->xip, snapshot->xip,
  			   snapshot->xcnt * sizeof(TransactionId));
  	}
  	else
! 		onto->xip = NULL;
  
  	/*
  	 * Setup subXID array. Don't bother to copy it if it had overflowed,
*************** CopySnapshot(Snapshot snapshot)
*** 240,253 ****
  	if (snapshot->subxcnt > 0 &&
  		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
  	{
! 		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
! 		memcpy(newsnap->subxip, snapshot->subxip,
  			   snapshot->subxcnt * sizeof(TransactionId));
  	}
  	else
! 		newsnap->subxip = NULL;
  
! 	return newsnap;
  }
  
  /*
--- 271,309 ----
  	if (snapshot->subxcnt > 0 &&
  		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
  	{
! 		onto->subxip = (TransactionId *) ((char *) onto + subxipoff);
! 		memcpy(onto->subxip, snapshot->subxip,
  			   snapshot->subxcnt * sizeof(TransactionId));
  	}
  	else
! 		onto->subxip = NULL;
  
! 	return onto;
! }
! 
! /*
!  * CopySnapshot
!  *		Copy the given snapshot.
!  *
!  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
!  * to 0.  The returned snapshot has the copied flag set.
!  */
! static Snapshot
! CopySnapshot(Snapshot snapshot)
! {
! 	Snapshot	newsnap;
! 	Size		size;
! 
! 	Assert(snapshot != InvalidSnapshot);
! 
! 	/* We allocate any XID arrays needed in the same palloc block. */
! 	size = sizeof(SnapshotData) + snapshot->xcnt * sizeof(TransactionId);
! 	if (snapshot->subxcnt > 0)
! 		size += snapshot->subxcnt * sizeof(TransactionId);
! 
! 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
! 
! 	return CopySnapshotOnto(snapshot, newsnap);
  }
  
  /*
*************** AtEOXact_Snapshot(bool isCommit)
*** 577,579 ****
--- 633,1021 ----
  	FirstSnapshotSet = false;
  	registered_xact_snapshot = false;
  }
+ 
+ /*
+  * PreCommit_Snapshot
+  *		Cleans up exported snapshots (this needs to happen before we update
+  *		our MyProc entry, hence it is in PreCommit).
+  */
+ void
+ PreCommit_Snapshot(void)
+ {
+ 	ListCell   *snapshot;
+ 	int			i;
+ 	char		buf[MAXPGPATH];
+ 
+ 	if (exportedSnapshots == NIL)
+ 		return;
+ 
+ 	Assert(list_length(exportedSnapshots) > 0);
+ 	Assert(TransactionIdIsValid(GetTopTransactionIdIfAny()));
+ 
+ 	for(i = 1; i <= list_length(exportedSnapshots); i++)
+ 	{
+ 		XactExportFilePath(buf, GetTopTransactionId(), i, "");
+ 		unlink(buf);
+ 	}
+ 
+ 	foreach(snapshot, exportedSnapshots)
+ 		UnregisterSnapshotFromOwner(lfirst(snapshot), TopTransactionResourceOwner);
+ 
+ 	exportedSnapshots = NIL;
+ }
+ 
+ /*
+  * DeleteAllExportedSnapshotFiles
+  *		Cleans up any files that have been left behind by a crashed backend
+  *		that had exported snapshots before it died.
+  */
+ void
+ DeleteAllExportedSnapshotFiles(void)
+ {
+ 	char		buf[MAXPGPATH];
+ 	DIR		   *s_dir;
+ 	struct dirent *s_de;
+ 
+ 	if (!(s_dir = AllocateDir(SNAPSHOT_EXPORT_DIR)))
+ 	{
+ 		/*
+ 		 * We really should have that directory in a sane cluster setup. But
+ 		 * then again if we don't it's not fatal enough to make it FATAL.
+ 		 */
+ 		elog(WARNING,
+ 			"could not open directory \"%s\": %m",
+ 			SNAPSHOT_EXPORT_DIR);
+ 		return;
+ 	}
+ 
+ 	while ((s_de = ReadDir(s_dir, SNAPSHOT_EXPORT_DIR)) != NULL)
+ 	{
+ 		if (strcmp(s_de->d_name, ".") == 0 ||
+ 			strcmp(s_de->d_name, "..") == 0)
+ 			continue;
+ 
+ 		snprintf(buf, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", s_de->d_name);
+ 		unlink(buf);
+ 	}
+ 	FreeDir(s_dir);
+ }
+ 
+ /*
+  * ExportSnapshot
+  *		Export the snapshot to a file so that other backends can import the same
+  *		snapshot.
+  *		Returns the token (the file name) that can be used to import this
+  *		snapshot.
+  */
+ static char *
+ ExportSnapshot(Snapshot snapshot)
+ {
+ #define SNAPSHOT_APPEND(x, y) (appendStringInfo(&buf, (x), (y)))
+ 	TransactionId *children, topXid;
+ 	FILE	   *f;
+ 	int			i;
+ 	int			nchildren;
+ 	MemoryContext oldcxt;
+ 	char		path[MAXPGPATH];
+ 	char		pathtmp[MAXPGPATH];
+ 	StringInfoData buf;
+ 
+ 	Assert(IsTransactionState());
+ 
+ 	/*
+ 	 * This will also assign a transaction id if we do not yet have one.
+ 	 */
+ 	topXid = GetTopTransactionId();
+ 
+ 	Assert(TransactionIdIsValid(GetTopTransactionIdIfAny()));
+ 
+ 	/*
+ 	 * We cannot export a snapshot from a subtransaction because in a
+ 	 * subtransaction we don't see our open subxip values in the snapshot so
+ 	 * they would be missing in the backend applying it.
+ 	 */
+ 	if (GetCurrentTransactionNestLevel() != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ 				 errmsg("cannot export a snapshot from a subtransaction")));
+ 
+ 	/*
+ 	 * We do however see our already committed subxip values and add them to
+ 	 * the subxip array.
+ 	 */
+ 	nchildren = xactGetCommittedChildren(&children);
+ 
+ 	initStringInfo(&buf);
+ 
+ 	/* Write up all the data that we return */
+ 	SNAPSHOT_APPEND("xid:%d ", topXid);
+ 	SNAPSHOT_APPEND("xmi:%d ", snapshot->xmin);
+ 	SNAPSHOT_APPEND("xma:%d ", snapshot->xmax);
+ 	/* Include our own transaction ID into the count. */
+ 	SNAPSHOT_APPEND("xcnt:%d ", snapshot->xcnt + 1);
+ 	for (i = 0; i < snapshot->xcnt; i++)
+ 		SNAPSHOT_APPEND("xip:%d ", snapshot->xip[i]);
+ 	/*
+ 	 * Finally add our own XID, since by definition we will still be running
+ 	 * when the other transaction takes over the snapshot.
+ 	 */
+ 	SNAPSHOT_APPEND("xip:%d ", topXid);
+ 	if (snapshot->suboverflowed || snapshot->subxcnt + nchildren > TOTAL_MAX_CACHED_SUBXIDS)
+ 		SNAPSHOT_APPEND("sof:%d ", 1);
+ 	else
+ 	{
+ 		SNAPSHOT_APPEND("sxcnt:%d ", snapshot->subxcnt + nchildren);
+ 		for (i = 0; i < snapshot->subxcnt; i++)
+ 			SNAPSHOT_APPEND("sxp:%d ", snapshot->subxip[i]);
+ 		/* Add already committed subtransactions. */
+ 		for (i = 0; i < nchildren; i++)
+ 			SNAPSHOT_APPEND("sxp:%d ", children[i]);
+ 	}
+ 
+ 	/*
+ 	 * buf ends with a trailing space but we leave it in for simplicity. The
+ 	 * parsing routines also depend on it.
+ 	 */
+ 
+ 	/* Register the snapshot and add it to the list of exported snapshots */
+ 	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
+ 
+ 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+ 	exportedSnapshots = lappend(exportedSnapshots, snapshot);
+ 	MemoryContextSwitchTo(oldcxt);
+ 
+ 	XactExportFilePath(pathtmp, topXid, list_length(exportedSnapshots), ".tmp");
+ 	if (!(f = AllocateFile(pathtmp, PG_BINARY_W)))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not create file \"%s\": %m", pathtmp)));
+ 
+ 	if (fwrite(buf.data, buf.len, 1, f) != 1)
+ 		/* Aborting the transaction will also call FreeFile() */
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", pathtmp)));
+ 
+ 	if (FreeFile(f))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", pathtmp)));
+ 
+ 	/*
+ 	 * Now that we have written everything into a .tmp file we rename the file
+ 	 * and remove the .tmp suffix. Our filename is predictable and we're
+ 	 * paranoid enough to not let us read a partially written file (we can't
+ 	 * read a .tmp file because this would fail the valid characters check in
+ 	 * ImportSnapshot).
+ 	 */
+ 	XactExportFilePath(path, topXid, list_length(exportedSnapshots), "");
+ 
+ 	if (rename(pathtmp, path) < 0)
+ 		ereport(WARNING,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 						pathtmp, path)));
+ 
+ 	/*
+ 	 * The basename of the file is what we return from pg_export_snapshot().
+ 	 * It's already in path in a textual format and we know that the path
+ 	 * starts with SNAPSHOT_EXPORT_DIR. Skip over the prefix and over the
+ 	 * slash and pstrdup it to not return a local variable.
+ 	 */
+ 	return pstrdup(path + strlen(SNAPSHOT_EXPORT_DIR) + 1);
+ #undef SNAPSHOT_APPEND
+ }
+ 
+ /*
+  * Poor man's type independent parser. We only use it in the three functions
+  * below so there's no need to get ambitious about putting extra (x) around the
+  * arguments.
+  */
+ #define SNAPSHOT_PARSE(valFunc, inFunc, type, strpp, prfx, notfound)		\
+ 	do {																	\
+ 		char	   *n, *p = strstr(*strpp, prfx);							\
+ 		type		v;														\
+ 																			\
+ 		if (!p)																\
+ 			return notfound;												\
+ 		p += strlen(prfx);													\
+ 		n = strchr(p, ' ');													\
+ 		if (!n)																\
+ 			return notfound;												\
+ 		*n = '\0';															\
+ 		v = valFunc(DirectFunctionCall1(inFunc, CStringGetDatum(p)));		\
+ 		*strpp = n + 1;														\
+ 		return v;															\
+ 	} while (0);
+ 
+ static int
+ parseIntFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetInt32, int4in, int, s, prefix, 0);
+ }
+ 
+ static bool
+ parseBoolFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetInt32, int4in, bool, s, prefix, false);
+ }
+ 
+ static TransactionId
+ parseXactFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetTransactionId, xidin, TransactionId,
+ 				   s, prefix, InvalidTransactionId);
+ }
+ 
+ #undef SNAPSHOT_PARSE
+ 
+ /*
+  * ImportSnapshot
+  *      Import a previously exported snapshot. We expect that whatever we get
+  *      is a filename in SNAPSHOT_EXPORT_DIR. Load the snapshot from that file.
+  *      This is called from "START TRANSACTION (SNAPSHOT = 'foo')" so we always
+  *      start fresh from zero with respect to the transaction state that we
+  *      work on.  Returns true on success and false on failure.
+  */
+ bool
+ ImportSnapshot(char *idstr)
+ {
+ 	char		path[MAXPGPATH];
+ 	FILE	   *f;
+ 	int			i;
+ 	char	   *s;
+ 	struct stat	stat_buf;
+ 	int			sxcnt, xcnt;
+ 	TransactionId xid, origXid, myXid;
+ 	SnapshotData snapshot = {HeapTupleSatisfiesMVCC};
+ 
+ 	/*
+ 	 * If we were in read committed mode then the next query would execute with a
+ 	 * new snapshot thus making this function call quite useless.
+ 	 */
+ 	if (!IsolationUsesXactSnapshot())
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 			 errmsg("A snapshot importing transaction must have ISOLATION "
+ 					"LEVEL SERIALIZABLE or ISOLATION LEVEL REPEATABLE READ")));
+ 
+ 	/* We're lucky to always start off from a pretty clean state */
+ 	Assert(IsTransactionState());
+ 	Assert(GetCurrentTransactionNestLevel() == 1);
+ 	Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+ 	Assert(CurrentSnapshot == NULL);
+ 	Assert(SecondarySnapshot == NULL);
+ 
+ 	/* verify the identifier, only 0-9,A-F and a hyphen are allowed... */
+ 	s = idstr;
+ 	while (*s)
+ 	{
+ 		if (!isdigit(*s) && !(*s >= 'A' && *s <= 'F') && *s != '-')
+ 			return false;
+ 		s++;
+ 	}
+ 
+ 	/*
+ 	 * Assign a transaction id. We only do this to detect a possible
+ 	 * transaction id wraparound which is somewhere between unlikely
+ 	 * and impossible...
+ 	 */
+ 	myXid = GetTopTransactionId();
+ 
+ 	snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", idstr);
+ 
+ 	/* get the size of the file so that we know how much memory we need */
+ 	if (stat(path, &stat_buf) != 0)
+ 		return false;
+ 
+ 	if (!(f = AllocateFile(path, PG_BINARY_R)))
+ 		return false;
+ 
+ 	s = palloc(stat_buf.st_size + 1);
+ 	if (fread(s, stat_buf.st_size, 1, f) != 1)
+ 		return false;
+ 
+ 	s[stat_buf.st_size] = '\0';
+ 
+ 	FreeFile(f);
+ 
+ 	origXid = parseXactFromText(&s, "xid:");
+ 
+ 	snapshot.xmin = parseXactFromText(&s, "xmi:");
+ 	Assert(snapshot.xmin != InvalidTransactionId);
+ 	snapshot.xmax = parseXactFromText(&s, "xma:");
+ 	Assert(snapshot.xmax != InvalidTransactionId);
+ 
+ 	xcnt = parseIntFromText(&s, "xcnt:");
+ 	/*
+ 	 * This snapshot only serves as a template, there is no need for it to have
+ 	 * maxProcs entries, so let's make it just as large as we need it.
+ 	 */
+ 	snapshot.xip = palloc(xcnt * sizeof(TransactionId));
+ 
+ 	i = 0;
+ 	while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
+ 		snapshot.xip[i++] = xid;
+ 	snapshot.xcnt = i;
+ 	Assert(snapshot.xcnt == xcnt);
+ 
+ 	/*
+ 	 * We only write "sof:1" if the snapshot overflowed. If not, then there is
+ 	 * no "sof:x" entry at all and parseBoolFromText() will return false.
+ 	 */
+ 	snapshot.suboverflowed = parseBoolFromText(&s, "sof:");
+ 
+ 	if (!snapshot.suboverflowed)
+ 	{
+ 		sxcnt = parseIntFromText(&s, "sxcnt:");
+ 		snapshot.subxip = palloc(sxcnt * sizeof(TransactionId));
+ 
+ 		i = 0;
+ 		while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
+ 			snapshot.subxip[i++] = xid;
+ 		snapshot.subxcnt = i;
+ 		Assert(snapshot.subxcnt == sxcnt);
+ 	} else {
+ 		snapshot.subxip = NULL;
+ 		snapshot.subxcnt = 0;
+ 	}
+ 
+ 	/* complete the snapshot data structure */
+ 	snapshot.curcid = 0;
+ 	snapshot.takenDuringRecovery = RecoveryInProgress();
+ 
+ 	/*
+ 	 * Note that MyProc->xmin can go backwards here. However this is safe
+ 	 * because the xmin we set here is the same as in the backend's proc->xmin
+ 	 * whose snapshot we are copying. At this very moment, anybody computing a
+ 	 * minimum will calculate at least this xmin as the overall xmin with or
+ 	 * without us setting MyProc->xmin to this value.
+ 	 */
+ 	LWLockAcquire(ProcArrayLock, LW_SHARED);
+ 	MyProc->xmin = snapshot.xmin;
+ 	LWLockRelease(ProcArrayLock);
+ 
+ 	/* bail out if the original transaction is not running anymore... */
+ 	if (!TransactionIdIsInProgress(origXid) || TransactionIdPrecedes(myXid, origXid))
+ 		return false;
+ 
+ 	/*
+ 	 * Install the snapshot as if we got it through GetTransactionSnapshot().
+ 	 * This will set up CurrentSnapshot and also set up the predicate locks for a
+ 	 * serializable transaction.
+ 	 */
+ 	GetTransactionSnapshotFromTemplate(&snapshot);
+ 	return true;
+ }
+ 
+ Datum
+ pg_export_snapshot(PG_FUNCTION_ARGS)
+ {
+ 	char	   *snapshotData;
+ 
+ 	RequireTransactionChain(true, "pg_export_snapshot()");
+ 
+ 	snapshotData = ExportSnapshot(GetTransactionSnapshot());
+ 	PG_RETURN_TEXT_P(cstring_to_text(snapshotData));
+ }
+ 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 399c734..c1b80da 100644
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
*************** main(int argc, char *argv[])
*** 2598,2603 ****
--- 2598,2604 ----
  		"pg_serial",
  		"pg_subtrans",
  		"pg_twophase",
+ 		"pg_snapshots",
  		"pg_multixact/members",
  		"pg_multixact/offsets",
  		"base",
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 799bf8b..bd2ae80 100644
*** a/src/include/access/twophase.h
--- b/src/include/access/twophase.h
***************
*** 25,33 ****
   */
  typedef struct GlobalTransactionData *GlobalTransaction;
  
- /* GUC variable */
- extern int	max_prepared_xacts;
- 
  extern Size TwoPhaseShmemSize(void);
  extern void TwoPhaseShmemInit(void);
  
--- 25,30 ----
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 96f43fe..a4e0387 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 2853,2858 ****
--- 2853,2860 ----
  DESCR("cancel a server process' current query");
  DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
  DESCR("terminate a server process");
+ DATA(insert OID = 3122 ( pg_export_snapshot		PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
+ DESCR("export a snapshot");
  DATA(insert OID = 2172 ( pg_start_backup		PGNSP PGUID 12 1 0 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
  DESCR("prepare for taking an online backup");
  DATA(insert OID = 2173 ( pg_stop_backup			PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9d19417..15326cf 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*************** extern PGDLLIMPORT int NBuffers;
*** 134,139 ****
--- 134,142 ----
  extern int	MaxBackends;
  extern int	MaxConnections;
  
+ /* GUC variable */
+ extern int	max_prepared_xacts;
+ 
  extern PGDLLIMPORT int MyProcPid;
  extern PGDLLIMPORT pg_time_t MyStartTime;
  extern PGDLLIMPORT struct Port *MyProcPort;
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 5ddbc1d..f4f0303 100644
*** a/src/include/storage/predicate.h
--- b/src/include/storage/predicate.h
*************** extern void CheckPointPredicate(void);
*** 42,48 ****
  extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
  
  /* predicate lock maintenance */
! extern Snapshot RegisterSerializableTransaction(Snapshot snapshot);
  extern void RegisterPredicateLockingXid(TransactionId xid);
  extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
  extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
--- 42,48 ----
  extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
  
  /* predicate lock maintenance */
! extern Snapshot RegisterSerializableTransaction(Snapshot snapshot, Snapshot stemplate);
  extern void RegisterPredicateLockingXid(TransactionId xid);
  extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
  extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 3c20fc4..a2440e4 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern void ExpireOldKnownAssignedTransa
*** 41,47 ****
  
  extern RunningTransactions GetRunningTransactionData(void);
  
! extern Snapshot GetSnapshotData(Snapshot snapshot);
  
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
--- 41,47 ----
  
  extern RunningTransactions GetRunningTransactionData(void);
  
! extern Snapshot GetSnapshotData(Snapshot snapshot, Snapshot stemplate);
  
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
*************** extern void XidCacheRemoveRunningXids(Tr
*** 71,74 ****
--- 71,80 ----
  						  int nxids, const TransactionId *xids,
  						  TransactionId latestXid);
  
+ /* Size of the ProcArray structure itself */
+ #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+ 
+ #define TOTAL_MAX_CACHED_SUBXIDS \
+ 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+ 
  #endif   /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index a7e7d3d..47f62ab 100644
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
*************** extern TransactionId TransactionXmin;
*** 23,28 ****
--- 23,30 ----
  extern TransactionId RecentXmin;
  extern TransactionId RecentGlobalXmin;
  
+ extern List *exportedSnapshots;
+ 
  extern Snapshot GetTransactionSnapshot(void);
  extern Snapshot GetLatestSnapshot(void);
  extern void SnapshotSetCommandId(CommandId curcid);
*************** extern void UpdateActiveSnapshotCommandI
*** 33,47 ****
--- 35,55 ----
  extern void PopActiveSnapshot(void);
  extern Snapshot GetActiveSnapshot(void);
  extern bool ActiveSnapshotSet(void);
+ extern Snapshot CopySnapshotOnto(Snapshot onto, Snapshot snapshot);
  
  extern Snapshot RegisterSnapshot(Snapshot snapshot);
  extern void UnregisterSnapshot(Snapshot snapshot);
  extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
  extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
  
+ extern void PreCommit_Snapshot(void);
  extern void AtSubCommit_Snapshot(int level);
  extern void AtSubAbort_Snapshot(int level);
  extern void AtEarlyCommit_Snapshot(void);
  extern void AtEOXact_Snapshot(bool isCommit);
  
+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
+ extern bool ImportSnapshot(char *idstr);
+ extern void DeleteAllExportedSnapshotFiles(void);
+ 
  #endif   /* SNAPMGR_H */

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Joachim Wieland (#1)

Re: synchronized snapshots

On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland <joe@mcknight.de> wrote:

In short, this is how it works:

SELECT pg_export_snapshot();
pg_export_snapshot
--------------------
000003A1-1
(1 row)

(and then in a different session)

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '000003A1-1');

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

If we have pg_export_snapshot() why not pg_import_snapshot() as well?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Joachim Wieland (#1)

Re: synchronized snapshots

On 15.08.2011 04:31, Joachim Wieland wrote:

The one thing that it does not implement is leaving the transaction in
an aborted state if the BEGIN TRANSACTION command failed for an
invalid snapshot identifier.

So what if the snapshot is invalid, the SNAPSHOT clause silently
ignored? That sounds really bad.

I can certainly see that this would be
useful but I am not sure if it justifies introducing this
inconsistency. We would have a BEGIN TRANSACTION command that left the
session in a different state depending on why it failed...

I don't understand what inconsistency you're talking about. What else
can cause BEGIN TRANSACTION to fail? Is there currently any failure mode
that doesn't leave the transaction in aborted state?

I am wondering if pg_export_snapshot() is still the right name, since
the snapshot is no longer exported to the user. It is exported to a
file but that's an implementation detail.

It's still exporting the snapshot to other sessions, that name still
seems appropriate to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Simon Riggs (#2)

Re: synchronized snapshots

On 15.08.2011 10:40, Simon Riggs wrote:

On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland<joe@mcknight.de> wrote:

In short, this is how it works:

SELECT pg_export_snapshot();
pg_export_snapshot
--------------------
000003A1-1
(1 row)

(and then in a different session)

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '000003A1-1');

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

If we have pg_export_snapshot() why not pg_import_snapshot() as well?

It would be nice a symmetry, but you'd need a limitation that
pg_import_snapshot() must be the first thing you do in the session. And
it might be hard to enforce that, as once you get control into the
function, you've already acquired another snapshot in the transaction to
run the "SELECT pg_import_snapshot()" query with. Specifying the
snapshot in the BEGIN command makes sense.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 14 years ago

In reply to: Simon Riggs (#2)

Re: synchronized snapshots

On Monday, August 15, 2011 08:40:34 Simon Riggs wrote:

On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland <joe@mcknight.de> wrote:

In short, this is how it works:

SELECT pg_export_snapshot();
pg_export_snapshot
--------------------
000003A1-1
(1 row)

(and then in a different session)

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT =
'000003A1-1');

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

Uhm. There already are several extensions to begin transaction. Like the just
added "DEFERRABLE".

If we have pg_export_snapshot() why not pg_import_snapshot() as well?

Using BEGIN has the advantage of making it explicit that it cannot be used
inside an existing transaction. Which I do find advantageous.

Andres

PostgreSQL - Hans-Jürgen Schönig

postgres@cybertec.at

over 14 years ago

In reply to: Simon Riggs (#2)

Re: synchronized snapshots

On Aug 15, 2011, at 9:40 AM, Simon Riggs wrote:

On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland <joe@mcknight.de> wrote:

In short, this is how it works:

SELECT pg_export_snapshot();
pg_export_snapshot
--------------------
000003A1-1
(1 row)

(and then in a different session)

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '000003A1-1');

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

If we have pg_export_snapshot() why not pg_import_snapshot() as well?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

i would definitely argue for a syntax like the one proposed by Joachim.. i could stay the same if this is turned into some sort of flashback implementation some day.

regards,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

Florian Weimer

fweimer@bfk.de

over 14 years ago

In reply to: Simon Riggs (#2)

Re: synchronized snapshots

* Simon Riggs:

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

Some language bindings treat BEGIN specially, so it might be difficult
to use this feature.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

Joachim Wieland

joe@mcknight.de

over 14 years ago

In reply to: Heikki Linnakangas (#3)

Re: synchronized snapshots

On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 15.08.2011 04:31, Joachim Wieland wrote:

The one thing that it does not implement is leaving the transaction in
an aborted state if the BEGIN TRANSACTION command failed for an
invalid snapshot identifier.

So what if the snapshot is invalid, the SNAPSHOT clause silently ignored?
That sounds really bad.

No, the command would fail, but since it fails, it doesn't change the
transaction state.

What was proposed originally was to start a transaction but throw an
error that leaves the transaction in the aborted state. But then the
command had some effect because it started a transaction block, even
though it failed.

I can certainly see that this would be
useful but I am not sure if it justifies introducing this
inconsistency. We would have a BEGIN TRANSACTION command that left the
session in a different state depending on why it failed...

I don't understand what inconsistency you're talking about. What else can
cause BEGIN TRANSACTION to fail? Is there currently any failure mode that
doesn't leave the transaction in aborted state?

Granted, it might only fail for parse errors so far, but that would
include for example sending BEGIN DEFERRABLE to a pre-9.1 server. It
wouldn't start a transaction and leave it in an aborted state, but it
would just fail.

I am wondering if pg_export_snapshot() is still the right name, since
the snapshot is no longer exported to the user. It is exported to a
file but that's an implementation detail.

It's still exporting the snapshot to other sessions, that name still seems
appropriate to me.

ok.

Joachim

Joachim Wieland

joe@mcknight.de

over 14 years ago

In reply to: Florian Weimer (#7)

Re: synchronized snapshots

On Mon, Aug 15, 2011 at 6:41 AM, Florian Weimer <fweimer@bfk.de> wrote:

* Simon Riggs:

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

Some language bindings treat BEGIN specially, so it might be difficult
to use this feature.

It's true, the command might require explicit support from language
bindings. However I used some perl test scripts, where you can also
send a START TRANSACTION command in an $dbh->do(...).

The intended use case of this feature is still pg_dump btw...

Joachim

#10

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#4)

Re: synchronized snapshots

On Mon, Aug 15, 2011 at 3:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

It would be nice a symmetry, but you'd need a limitation that
pg_import_snapshot() must be the first thing you do in the session. And it
might be hard to enforce that, as once you get control into the function,
you've already acquired another snapshot in the transaction to run the
"SELECT pg_import_snapshot()" query with. Specifying the snapshot in the
BEGIN command makes sense.

+1. Also, I am pretty sure that there are drivers out there, and
connection poolers, that keep track of the transaction state by
watching commands go by. Right now you can tell by the first word of
the command whether it's something that might change the transaction
state; I wouldn't like to make that harder.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Simon Riggs (#2)

Re: synchronized snapshots

Simon Riggs <simon@2ndQuadrant.com> wrote:

Joachim Wieland <joe@mcknight.de> wrote:

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT =
'000003A1-1');

I don't see the need to change the BEGIN command, which is SQL
Standard.

No, it's not standard.

To quote from our docs at:

http://www.postgresql.org/docs/9.0/interactive/sql-begin.html#AEN58214

| BEGIN is a PostgreSQL language extension. It is equivalent to the
| SQL-standard command START TRANSACTION, whose reference page
| contains additional compatibility information.
|
| Incidentally, the BEGIN key word is used for a different purpose
| in embedded SQL. You are advised to be careful about the
| transaction semantics when porting database applications.

In checking the most recent standards draft I have available, it
appears that besides embedded SQL, this keyword is also used in the
standard trigger declaration syntax. Using BEGIN to start a
transaction is a PostgreSQL extension to the standard. That said,
if we support a feature on the nonstandard BEGIN statement, we
typically add it as an extension to the standard START TRANSACTION
and SET TRANSACTION statements. Through 9.0 that consisted of
having a non-standard default for isolation level and the ability to
omit commas required by the standard. In 9.1 we added another
optional transaction property which defaults to standard behavior:
DEFERRABLE.

If we're talking about a property of a transaction, like the
transaction snapshot, it seems to me to be best to support it using
the same statements we use for other transaction properties.

-Kevin

#12

Jim Nasby

jim@nasby.net

over 14 years ago

In reply to: Joachim Wieland (#8)

Re: synchronized snapshots

On Aug 15, 2011, at 6:23 AM, Joachim Wieland wrote:

On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 15.08.2011 04:31, Joachim Wieland wrote:

The one thing that it does not implement is leaving the transaction in
an aborted state if the BEGIN TRANSACTION command failed for an
invalid snapshot identifier.

So what if the snapshot is invalid, the SNAPSHOT clause silently ignored?
That sounds really bad.

No, the command would fail, but since it fails, it doesn't change the
transaction state.

What was proposed originally was to start a transaction but throw an
error that leaves the transaction in the aborted state. But then the
command had some effect because it started a transaction block, even
though it failed.

It certainly seems safer to me to set the transaction to an aborted state; you were expecting a set of commands to run with one snapshot, but if we don't abort the transaction they'll end up running anyway and doing so with the *wrong* snapshot. That could certainly lead to data corruption.

I suspect that all the other cases of BEGIN failing would be syntax errors, so you would immediately know in testing that something was wrong. A missing file is definitely not a syntax error, so we can't really depend on user testing to ensure this is handled correctly. IMO, that makes it critical that that error puts us in an aborted transaction.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#13

Joachim Wieland

joe@mcknight.de

over 14 years ago

In reply to: Jim Nasby (#12)

Re: synchronized snapshots

On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby <jim@nasby.net> wrote:

I suspect that all the other cases of BEGIN failing would be syntax errors, so
you would immediately know in testing that something was wrong. A missing file
is definitely not a syntax error, so we can't really depend on user testing to ensure
this is handled correctly. IMO, that makes it critical that that error puts us in an
aborted transaction.

Why can we not just require the user to verify if his BEGIN query
failed or succeeded?
Is that really too much to ask for?

Also see what Robert wrote about proxies in between that keep track of
the transaction
state. Consider they see a BEGIN query that fails. How would they know
if the session
is now in an aborted transaction or not in a transaction at all?

Joachim

#14

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Joachim Wieland (#13)

Re: synchronized snapshots

On Mon, Aug 15, 2011 at 6:46 PM, Joachim Wieland <joe@mcknight.de> wrote:

On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby <jim@nasby.net> wrote:

I suspect that all the other cases of BEGIN failing would be syntax errors, so
you would immediately know in testing that something was wrong. A missing file
is definitely not a syntax error, so we can't really depend on user testing to ensure
this is handled correctly. IMO, that makes it critical that that error puts us in an
aborted transaction.

Why can we not just require the user to verify if his BEGIN query
failed or succeeded?
Is that really too much to ask for?

Also see what Robert wrote about proxies in between that keep track of
the transaction
state. Consider they see a BEGIN query that fails. How would they know
if the session
is now in an aborted transaction or not in a transaction at all?

I think the point here is that we should be consistent. Currently,
you can make BEGIN fail by doing it on the standby, and asking for
READ WRITE mode:

rhaas=# begin transaction read write;
ERROR: cannot set transaction read-write mode during recovery

After doing that, you are NOT in a transaction context:

rhaas=# select 1;
?column?
----------
1
(1 row)

So whatever this does should be consistent with that, at least IMHO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Alvaro Herrera

alvherre@commandprompt.com

over 14 years ago

In reply to: Robert Haas (#14)

Re: synchronized snapshots

Excerpts from Robert Haas's message of mar ago 16 09:59:04 -0400 2011:

On Mon, Aug 15, 2011 at 6:46 PM, Joachim Wieland <joe@mcknight.de> wrote:

Also see what Robert wrote about proxies in between that keep track
of the transaction state. Consider they see a BEGIN query that
fails. How would they know if the session is now in an aborted
transaction or not in a transaction at all?

I think the point here is that we should be consistent. Currently,
you can make BEGIN fail by doing it on the standby, and asking for
READ WRITE mode:

rhaas=# begin transaction read write;
ERROR: cannot set transaction read-write mode during recovery

After doing that, you are NOT in a transaction context:

rhaas=# select 1;
?column?
----------
1
(1 row)

So whatever this does should be consistent with that, at least IMHO.

I think we argued about a very similar problem years ago and the outcome
was that you should be left in an aborted transaction block; otherwise
running a dumb SQL script (which has no way to "abort if it fails")
could wreak serious havoc (?). I think this failure to behave in that
fashion on the standby is something to be fixed, not imitated.

What this says is that a driver or app seeing BEGIN fail should issue
ROLLBACK before going further -- which seems the intuitive way to behave
to me. No?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#16

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Alvaro Herrera (#15)

Re: synchronized snapshots

On Tue, Aug 16, 2011 at 10:43 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Robert Haas's message of mar ago 16 09:59:04 -0400 2011:

On Mon, Aug 15, 2011 at 6:46 PM, Joachim Wieland <joe@mcknight.de> wrote:

Also see what Robert wrote about proxies in between that keep track
of the transaction state. Consider they see a BEGIN query that
fails. How would they know if the session is now in an aborted
transaction or not in a transaction at all?

I think the point here is that we should be consistent. Currently,
you can make BEGIN fail by doing it on the standby, and asking for
READ WRITE mode:

rhaas=# begin transaction read write;
ERROR: cannot set transaction read-write mode during recovery

After doing that, you are NOT in a transaction context:

rhaas=# select 1;
?column?
----------
1
(1 row)

So whatever this does should be consistent with that, at least IMHO.

I think we argued about a very similar problem years ago and the outcome
was that you should be left in an aborted transaction block; otherwise
running a dumb SQL script (which has no way to "abort if it fails")
could wreak serious havoc (?). I think this failure to behave in that
fashion on the standby is something to be fixed, not imitated.

What this says is that a driver or app seeing BEGIN fail should issue
ROLLBACK before going further -- which seems the intuitive way to behave
to me. No?

Maybe. But if we're going to change the behavior of BEGIN, then (1)
we need to think about backward compatibility and (2) we should change
it across the board. It's not for this patch to go invent something
that's inconsistent with what we're already doing elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Jim Nasby

jim@nasby.net

over 14 years ago

In reply to: Joachim Wieland (#13)

Re: synchronized snapshots

On Aug 15, 2011, at 5:46 PM, Joachim Wieland wrote:

On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby <jim@nasby.net> wrote:

I suspect that all the other cases of BEGIN failing would be syntax errors, so
you would immediately know in testing that something was wrong. A missing file
is definitely not a syntax error, so we can't really depend on user testing to ensure
this is handled correctly. IMO, that makes it critical that that error puts us in an
aborted transaction.

Why can we not just require the user to verify if his BEGIN query
failed or succeeded?
Is that really too much to ask for?

It's something else that you have to remember to get right. psql, for example, will blindly continue on unless you remembered to tell it to exit on an error.

Also, an invalid transaction seems to be the result of least surprise... if you cared enough to begin a transaction, you're going to expect that either everything between that and the COMMIT succeeds or fails, not something in-between.

Also see what Robert wrote about proxies in between that keep track of
the transaction
state. Consider they see a BEGIN query that fails. How would they know
if the session
is now in an aborted transaction or not in a transaction at all?

AFAIK a proxy can tell if a transaction is in progress or not via libpq. Worst-case, it just needs to send an extra ROLLBACK.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#18

Jeff Davis

pgsql@j-davis.com

over 14 years ago

In reply to: Jim Nasby (#17)

Re: synchronized snapshots

On Tue, 2011-08-16 at 11:01 -0500, Jim Nasby wrote:

Also, an invalid transaction seems to be the result of least
surprise... if you cared enough to begin a transaction, you're going
to expect that either everything between that and the COMMIT succeeds
or fails, not something in-between.

Agreed.

Perhaps we need a new utility command to set the snapshot to make the
error handling a little more obvious?

Regards,
Jeff Davis

#19

Jim Nasby

jim@nasby.net

over 14 years ago

In reply to: Jeff Davis (#18)

Re: synchronized snapshots

On Aug 16, 2011, at 5:40 PM, Jeff Davis wrote:

On Tue, 2011-08-16 at 11:01 -0500, Jim Nasby wrote:

Also, an invalid transaction seems to be the result of least
surprise... if you cared enough to begin a transaction, you're going
to expect that either everything between that and the COMMIT succeeds
or fails, not something in-between.

Agreed.

Perhaps we need a new utility command to set the snapshot to make the
error handling a little more obvious?

Well, it appears we have a larger problem, as Robert pointed out that trying to start a writable transaction on a hot standby leaves you not in a transaction (which I feel is a problem).

So IMHO the right thing to do here is make it so that runtime errors in BEGIN leave you in an invalid transaction. Then we can decide on the API for synchronized snapshots that makes sense instead of working around the behavior of BEGIN.

I guess the big question to answer now is: what's the backwards compatibility impact of changing how BEGIN deals with runtime errors?
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#20

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Jim Nasby (#19)

Re: synchronized snapshots

Jim Nasby <jim@nasby.net> writes:

Well, it appears we have a larger problem, as Robert pointed out that trying to start a writable transaction on a hot standby leaves you not in a transaction (which I feel is a problem).

So IMHO the right thing to do here is make it so that runtime errors in BEGIN leave you in an invalid transaction. Then we can decide on the API for synchronized snapshots that makes sense instead of working around the behavior of BEGIN.

I'm not convinced by the above argument, because it requires that
you pretend there's a significant difference between syntax errors and
"run time" errors (whatever those are). Syntax errors in a BEGIN
command are not going to leave you in an aborted transaction, because
the backend is not going to recognize the command as a BEGIN at all.
This means that frontends *must* be capable of dealing with the case
that a failed BEGIN didn't start a transaction. (Either that, or
they just assume their commands are always syntactically perfect,
which seems like pretty fragile programming to me; and the more weird
nonstandard options we load onto BEGIN, the less tenable the position
becomes. For example, if you feed BEGIN option-foo to a server that's
a bit older than you thought it was, you will get a syntax error.)
If we have some failure cases that start a transaction and some that do
not, we just have a mess, IMO.

I think we'd be far better off to maintain the position that a failed
BEGIN does not start a transaction, under any circumstances. To do
that, we cannot have this new option attached to the BEGIN, which is a
good thing anyway IMO from a standards compatibility point of view.
It'd be better to make it a separate utility statement. There is no
logical problem in doing that, and we already have a precedent for
utility statements that do something special before the transaction
snapshot is taken: see LOCK.

In fact, now that I think about it, setting the transaction snapshot
from a utility statement would be functionally useful because then you
could take locks beforehand.

And as a bonus, we don't have a backwards compatibility problem to solve.

regards, tom lane

#21

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#20)

Re: synchronized snapshots

On Tue, Aug 16, 2011 at 8:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jim Nasby <jim@nasby.net> writes:

Well, it appears we have a larger problem, as Robert pointed out that trying to start a writable transaction on a hot standby leaves you not in a transaction (which I feel is a problem).

So IMHO the right thing to do here is make it so that runtime errors in BEGIN leave you in an invalid transaction. Then we can decide on the API for synchronized snapshots that makes sense instead of working around the behavior of BEGIN.

I'm not convinced by the above argument, because it requires that
you pretend there's a significant difference between syntax errors and
"run time" errors (whatever those are). Syntax errors in a BEGIN
command are not going to leave you in an aborted transaction, because
the backend is not going to recognize the command as a BEGIN at all.
This means that frontends *must* be capable of dealing with the case
that a failed BEGIN didn't start a transaction. (Either that, or
they just assume their commands are always syntactically perfect,
which seems like pretty fragile programming to me; and the more weird
nonstandard options we load onto BEGIN, the less tenable the position
becomes. For example, if you feed BEGIN option-foo to a server that's
a bit older than you thought it was, you will get a syntax error.)
If we have some failure cases that start a transaction and some that do
not, we just have a mess, IMO.

More or less agreed.

I think we'd be far better off to maintain the position that a failed
BEGIN does not start a transaction, under any circumstances.

Also agreed.

To do
that, we cannot have this new option attached to the BEGIN, ...

Eh, why not?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#21)

Re: synchronized snapshots

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 16, 2011 at 8:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we'd be far better off to maintain the position that a failed
BEGIN does not start a transaction, under any circumstances.

Also agreed.

To do
that, we cannot have this new option attached to the BEGIN, ...

Eh, why not?

Maybe I wasn't paying close enough attention to the thread, but I had
the idea that there was some implementation reason why not. If not,
we could still load the option onto BEGIN ... but I still find myself
liking the idea of a separate command better, because of the locking
issue.

regards, tom lane

#23

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#22)

Re: synchronized snapshots

On Tue, Aug 16, 2011 at 8:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 16, 2011 at 8:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think we'd be far better off to maintain the position that a failed
BEGIN does not start a transaction, under any circumstances.

Also agreed.

To do
that, we cannot have this new option attached to the BEGIN, ...

Eh, why not?

Maybe I wasn't paying close enough attention to the thread, but I had
the idea that there was some implementation reason why not. If not,
we could still load the option onto BEGIN ... but I still find myself
liking the idea of a separate command better, because of the locking
issue.

Why does it matter whether you take the locks before or after the snapshot?

If you're concerned with minimizing the race, what you should do is
take all relevant locks in the parent before exporting the snapshot.

I am not wild about adding another toplevel command for this. It
seems a rather narrow use case, and attaching it to BEGIN feels
natural to me. There may be some small benefit also in terms of
minimizing the amount of sanity checking that must be done - for
example, at BEGIN time, you don't have to check for the case where a
snapshot has already been set.

If we did add another toplevel command, what would we call it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24

Jeff Davis

pgsql@j-davis.com

over 14 years ago

In reply to: Tom Lane (#20)

Re: synchronized snapshots

On Tue, 2011-08-16 at 20:35 -0400, Tom Lane wrote:

I'm not convinced by the above argument, because it requires that
you pretend there's a significant difference between syntax errors and
"run time" errors (whatever those are).

After a syntax error like "COMMMIT" the transaction will remain inside
the failed transaction block, but an error during COMMIT (e.g. deferred
constraint check failure) will exit the transaction block.

I think we'd be far better off to maintain the position that a failed
BEGIN does not start a transaction, under any circumstances. To do
that, we cannot have this new option attached to the BEGIN, which is a
good thing anyway IMO from a standards compatibility point of view.
It'd be better to make it a separate utility statement.

+1 for a utility statement. Much clearer from the user's standpoint what
kind of errors they might expect, and whether the session will remain in
a transaction block.

Regards,
Jeff Davis

#25

Jeff Davis

pgsql@j-davis.com

over 14 years ago

In reply to: Robert Haas (#23)

Re: synchronized snapshots

On Tue, 2011-08-16 at 21:08 -0400, Robert Haas wrote:

attaching it to BEGIN feels natural to me.

My only objection is that people have different expectations about
whether the session will remain in a transaction block when they
encounter an error. So, it's hard to make this work without surprising
about half the users.

And there are some fairly significant consequences to users who guess
that they will remain in a transaction block in case of an error; or who
are just careless and don't consider that an error may occur.

If we did add another toplevel command, what would we call it?

"SET TRANSACTION SNAPSHOT" perhaps?

Regards,
Jeff Davis

#26

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Jeff Davis (#25)

Re: synchronized snapshots

On Tue, Aug 16, 2011 at 9:54 PM, Jeff Davis <pgsql@j-davis.com> wrote:

If we did add another toplevel command, what would we call it?

"SET TRANSACTION SNAPSHOT" perhaps?

Hmm, that's not bad, but I think we'd have to partially reserve
TRANSACTION to make it work, which doesn't seem worth the pain it
would cause.

We could do something like TRANSACTION SNAPSHOT IS 'xyz', but that's a
bit awkard.

I still like BEGIN SNAPSHOT 'xyz' -- or even adding a generic options
list like we use for some other commands, i.e. BEGIN (snapshot 'xyz'),
which would leave the door open to arbitrary amounts of future
nonstandard fiddling without the need for any more keywords. I
understand the point about the results of a BEGIN failure leaving you
outside a transaction, but that really only matters if you're doing
"psql < dumb_script.sql", which is impractical for this feature
anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27

Peter Eisentraut

peter_e@gmx.net

over 14 years ago

In reply to: Tom Lane (#20)

Re: synchronized snapshots

On tis, 2011-08-16 at 20:35 -0400, Tom Lane wrote:

In fact, now that I think about it, setting the transaction snapshot
from a utility statement would be functionally useful because then you
could take locks beforehand.

Another issue is that in some client interfaces, BEGIN and COMMIT are
hidden behind API calls, which cannot easily be changed or equipped with
new parameters. So in order to have this functionality available
through those interfaces, we'd need a separately callable command.

#28

Bruce Momjian

bruce@momjian.us

over 14 years ago

In reply to: Peter Eisentraut (#27)

Re: synchronized snapshots

Peter Eisentraut wrote:

On tis, 2011-08-16 at 20:35 -0400, Tom Lane wrote:

In fact, now that I think about it, setting the transaction snapshot
from a utility statement would be functionally useful because then you
could take locks beforehand.

Another issue is that in some client interfaces, BEGIN and COMMIT are
hidden behind API calls, which cannot easily be changed or equipped with
new parameters. So in order to have this functionality available
through those interfaces, we'd need a separately callable command.

How do they set a transaction to SERIALIZABLE? Seem the same syntax
should be used here.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#29

Peter Eisentraut

peter_e@gmx.net

over 14 years ago

In reply to: Bruce Momjian (#28)

Re: synchronized snapshots

On lör, 2011-08-20 at 09:56 -0400, Bruce Momjian wrote:

Peter Eisentraut wrote:

On tis, 2011-08-16 at 20:35 -0400, Tom Lane wrote:

In fact, now that I think about it, setting the transaction snapshot
from a utility statement would be functionally useful because then you
could take locks beforehand.

Another issue is that in some client interfaces, BEGIN and COMMIT are
hidden behind API calls, which cannot easily be changed or equipped with
new parameters. So in order to have this functionality available
through those interfaces, we'd need a separately callable command.

How do they set a transaction to SERIALIZABLE? Seem the same syntax
should be used here.

The API typically has parameters to set the isolation level, since
that's a standardized property.

#30

Joachim Wieland

joe@mcknight.de

over 14 years ago

In reply to: Joachim Wieland (#1)

2 attachment(s)

Re: synchronized snapshots

On Sun, Aug 14, 2011 at 9:31 PM, Joachim Wieland <joe@mcknight.de> wrote:

This is a patch to implement synchronized snapshots. It is based on
Alvaro's specifications in:

http://archives.postgresql.org/pgsql-hackers/2011-02/msg02074.php

Here is a new version of this patch, what has changed is that the
snapshot is now imported via:

BEGIN
[... set serializable or read committed on the BEGIN or via SET TRANSACTION ...]
SET TRANSACTION SNAPSHOT '00000801-1'

This way any failure importing the snapshot leaves the transaction in
the aborted state.

I am also attaching a small perl script that demonstrates a
serialization failure with an imported snapshot.

This is the link to the previous patch:
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00684.php

Joachim

Attachments:

syncSnapshots.2.difftext/x-patch; charset=US-ASCII; name=syncSnapshots.2.diffDownload

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0b6a109..ce8653e 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** FOR EACH ROW EXECUTE PROCEDURE suppress_
*** 15023,15026 ****
--- 15023,15100 ----
          <xref linkend="SQL-CREATETRIGGER">.
      </para>
    </sect1>
+ 
+   <sect1 id="functions-snapshotsync">
+    <title>Snapshot Synchronization Functions</title>
+ 
+    <indexterm>
+      <primary>pg_export_snapshot</primary>
+    </indexterm>
+ 
+    <para>
+      <productname>PostgreSQL</> allows different sessions to synchronize their
+      snapshots. A database snapshot determines which data is visible to
+      the client that is using this snapshot. Synchronized snapshots are necessary when
+      two clients need to see the same content in the database. If these two clients
+      just connected to the database and opened their transactions, then they could
+      never be sure that there was no data modification right between both
+      connections.
+    </para>
+    <para>
+      As a solution, <productname>PostgreSQL</> offers the function
+      <function>pg_export_snapshot</> which saves the snapshot internally and
+      from then on until the end of the saving transaction, the snapshot can be
+      copied into a a new transaction with the <xref linkend="sql-set-transaction">
+      command to open a second transaction with the
+      exact same snapshot. Now both transactions are guaranteed to see the exact same
+      data even though they might have connected at different times.
+    </para>
+    <para>
+      Note that a snapshot can only be used to start a new transaction as long
+      as the transaction that originally saved it is held open. Also note that even
+      after the synchronization both clients still run their own independent
+      transactions. As a consequence, even though synchronized with respect to
+      reading pre-existing data, both transactions won't be able to see each other's
+      uncommitted data.
+    </para>
+    <table id="functions-snapshot-synchronization">
+     <title>Snapshot Synchronization Functions</title>
+     <tgroup cols="3">
+      <thead>
+       <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+       </row>
+      </thead>
+ 
+      <tbody>
+       <row>
+        <entry>
+         <literal><function>pg_export_snapshot()</function></literal>
+        </entry>
+        <entry><type>text</type></entry>
+        <entry>Save the snapshot and return its identifier</entry>
+       </row>
+      </tbody>
+     </tgroup>
+    </table>
+ 
+    <para>
+       The function <function>pg_export_snapshot</> does not take an argument
+       and returns the snapshot's identifier as <type>text</type> data. Internally the
+       function will save the snapshot data to a file so that it can be retrieved
+       from a different backend process later on. Note that as soon as the
+       transaction ends, any saved snapshots become invalid and their
+       identifiers cannot be used in other transactions anymore. If the function
+       has been executed, the transaction cannot be prepared anymore with <xref
+       linkend="sql-prepare-transaction">.
+    </para>
+ <programlisting>
+ SELECT pg_export_snapshot();
+ 
+  pg_export_snapshot
+ --------------------
+  000003A1-1
+ (1 row)
+ </programlisting>
+   </sect1>
  </chapter>
+ 
diff --git a/doc/src/sgml/ref/set_transaction.sgml b/doc/src/sgml/ref/set_transaction.sgml
index e28a7e1..059b94b 100644
*** a/doc/src/sgml/ref/set_transaction.sgml
--- b/doc/src/sgml/ref/set_transaction.sgml
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 40,45 ****
--- 40,46 ----
      ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
      READ WRITE | READ ONLY
      [ NOT ] DEFERRABLE
+     SNAPSHOT <replaceable class="parameter">snapshot-id</replaceable>
  </synopsis>
   </refsynopsisdiv>
  
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 146,151 ****
--- 147,167 ----
     contributing to or being canceled by a serialization failure.  This mode
     is well suited for long-running reports or backups.
    </para>
+ 
+   <para>
+    The <literal>SNAPSHOT</literal> property allows a new transaction to run
+    with the same snapshot as an already running transaction.  A call to
+    <literal>pg_export_snapshot</literal> (see <xref
+    linkend="functions-snapshotsync">) must have been executed in the other
+    transaction. This function returns a snapshot id which must be passed as a
+    parameter to this command to create a second transaction running with the same
+    snapshot. You also need to make the transaction <literal>ISOLATION LEVEL
+    SERIALIZABLE</literal> or <literal>ISOLATION LEVEL REPEATABLE READ</literal>.
+    If the transaction has already executed a query, started a subtransaction or
+    assigned a snapshot, no further snapshot assignment is possible in this
+    transaction.
+   </para>
+ 
   </refsect1>
  
   <refsect1>
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 178,183 ****
--- 194,225 ----
    </para>
   </refsect1>
  
+  <refsect1>
+   <title>Examples</title>
+ 
+   <para>
+    To begin a new transaction block with the same snapshot as an already
+    existing transaction, first export the snapshot from the existing
+    transaction. This will return the snapshot id:
+ 
+ <programlisting>
+ # SELECT pg_export_snapshot();
+  pg_export_snapshot
+ --------------------
+  000003A1-1
+ (1 row)
+ </programlisting>
+ 
+    Then reference this snapshot id as the first command in the newly opened
+    transaction:
+ 
+ <programlisting>
+ BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET TRANSACTION SNAPSHOT '000003A1-1';
+ </programlisting>
+   </para>
+  </refsect1>
+ 
   <refsect1 id="R1-SQL-SET-TRANSACTION-3">
    <title>Compatibility</title>
  
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 198,206 ****
    </para>
  
    <para>
!    The <literal>DEFERRABLE</literal>
     <replaceable class="parameter">transaction_mode</replaceable>
!    is a <productname>PostgreSQL</productname> language extension.
    </para>
  
    <para>
--- 240,250 ----
    </para>
  
    <para>
!    <literal>DEFERRABLE</literal>
     <replaceable class="parameter">transaction_mode</replaceable>
!    and <literal>SNAPSHOT</literal> <replaceable
!    class="parameter">snapshot-id</replaceable> are
!    <productname>PostgreSQL</productname> language extensions.
    </para>
  
    <para>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index de3f965..8b58cb5 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** CommitTransaction(void)
*** 1852,1857 ****
--- 1852,1863 ----
  	 */
  	PreCommit_Notify();
  
+ 	/*
+ 	 * Cleans up exported snapshots (this needs to happen before we update
+ 	 * our MyProc entry).
+ 	 */
+ 	PreCommit_Snapshot();
+ 
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
*************** PrepareTransaction(void)
*** 2070,2075 ****
--- 2076,2086 ----
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  				 errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
  
+ 	if (exportedSnapshots)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				 errmsg("cannot PREPARE a transaction that has exported snapshots")));
+ 
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f65ddc..331a564 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 58,63 ****
--- 58,64 ----
  #include "utils/guc.h"
  #include "utils/ps_status.h"
  #include "utils/relmapper.h"
+ #include "utils/snapmgr.h"
  #include "utils/timestamp.h"
  #include "pg_trace.h"
  
*************** StartupXLOG(void)
*** 6373,6378 ****
--- 6374,6384 ----
  		CheckRequiredParameterValues();
  
  		/*
+ 		 * We can delete any saved transaction snapshots that still exist
+ 		 */
+ 		DeleteAllExportedSnapshotFiles();
+ 
+ 		/*
  		 * We're in recovery, so unlogged relations relations may be trashed
  		 * and must be reset.  This should be done BEFORE allowing Hot Standby
  		 * connections, so that read-only backends don't try to read whatever
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e9f3896..e1ab161 100644
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*************** static void processCASbits(int cas_bits,
*** 553,560 ****
  
  	SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
  	SERIALIZABLE SERVER SESSION SESSION_USER SET SETOF SHARE
! 	SHOW SIMILAR SIMPLE SMALLINT SOME STABLE STANDALONE_P START STATEMENT
! 	STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
  	SYMMETRIC SYSID SYSTEM_P
  
  	TABLE TABLES TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN TIME TIMESTAMP
--- 553,560 ----
  
  	SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
  	SERIALIZABLE SERVER SESSION SESSION_USER SET SETOF SHARE
! 	SHOW SIMILAR SIMPLE SMALLINT SNAPSHOT SOME STABLE STANDALONE_P START
! 	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
  	SYMMETRIC SYSID SYSTEM_P
  
  	TABLE TABLES TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN TIME TIMESTAMP
*************** set_rest:	/* Generic SET syntaxes: */
*** 1286,1291 ****
--- 1286,1299 ----
  					n->args = $2;
  					$$ = n;
  				}
+ 			| TRANSACTION SNAPSHOT Sconst
+ 				{
+ 					VariableSetStmt *n = makeNode(VariableSetStmt);
+ 					n->kind = VAR_SET_VALUE;
+ 					n->name = "TRANSACTION SNAPSHOT";
+ 					n->args = list_make1(makeStringConst($3, @3));
+ 					$$ = n;
+ 				}
  			| SESSION CHARACTERISTICS AS TRANSACTION transaction_mode_list
  				{
  					VariableSetStmt *n = makeNode(VariableSetStmt);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 9489012..b665d75 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** ProcArrayShmemSize(void)
*** 166,174 ****
  {
  	Size		size;
  
- 	/* Size of the ProcArray structure itself */
- #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
- 
  	size = offsetof(ProcArrayStruct, procs);
  	size = add_size(size, mul_size(sizeof(PGPROC *), PROCARRAY_MAXPROCS));
  
--- 166,171 ----
*************** ProcArrayShmemSize(void)
*** 179,192 ****
  	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
  	 * main structures created in those functions must be identically sized,
  	 * since we may at times copy the whole of the data structures around. We
! 	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
  	 *
  	 * Ideally we'd only create this structure if we were actually doing hot
  	 * standby in the current run, but we don't know that yet at the time
  	 * shared memory is being set up.
  	 */
- #define TOTAL_MAX_CACHED_SUBXIDS \
- 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
  
  	if (EnableHotStandby)
  	{
--- 176,187 ----
  	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
  	 * main structures created in those functions must be identically sized,
  	 * since we may at times copy the whole of the data structures around. We
! 	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS, defined in procarray.h.
  	 *
  	 * Ideally we'd only create this structure if we were actually doing hot
  	 * standby in the current run, but we don't know that yet at the time
  	 * shared memory is being set up.
  	 */
  
  	if (EnableHotStandby)
  	{
*************** GetOldestXmin(bool allDbs, bool ignoreVa
*** 1144,1150 ****
   * not statically allocated (see xip allocation below).
   */
  Snapshot
! GetSnapshotData(Snapshot snapshot)
  {
  	ProcArrayStruct *arrayP = procArray;
  	TransactionId xmin;
--- 1139,1145 ----
   * not statically allocated (see xip allocation below).
   */
  Snapshot
! GetSnapshotData(Snapshot snapshot, Snapshot stemplate)
  {
  	ProcArrayStruct *arrayP = procArray;
  	TransactionId xmin;
*************** GetSnapshotData(Snapshot snapshot)
*** 1158,1163 ****
--- 1153,1181 ----
  	Assert(snapshot != NULL);
  
  	/*
+ 	 * We only get a valid snapshot in stemplate if the snapshot
+ 	 * synchronization feature used. In that case we just need to copy the
+ 	 * values that we get onto the snapshot we return.
+ 	 * Note that in this case we always duplicate an existing snapshot, that is
+ 	 * currently held by another active transaction. That's why we do not need
+ 	 * to update any { RecentGlobalXmin, RecentXmin, globalxmin }.
+ 	 */
+ 	if (stemplate != InvalidSnapshot)
+ 	{
+ 		/*
+ 		 * 'stemplate' is only read and its values are copied onto 'snapshot'.
+ 		 */
+ 		CopySnapshotOnto(stemplate, snapshot);
+ 
+ 		/*
+ 		 * We can use the result of the copy except for that this snapshot
+ 		 * should look like new and not copied.
+ 		 */
+ 		snapshot->copied = false;
+ 		return snapshot;
+ 	}
+ 
+ 	/*
  	 * Allocating space for maxProcs xids is usually overkill; numProcs would
  	 * be sufficient.  But it seems better to do the malloc while not holding
  	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1754180..6df30d8 100644
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
*************** static void OldSerXidSetActiveSerXmin(Tr
*** 416,423 ****
  
  static uint32 predicatelock_hash(const void *key, Size keysize);
  static void SummarizeOldestCommittedSxact(void);
! static Snapshot GetSafeSnapshot(Snapshot snapshot);
! static Snapshot RegisterSerializableTransactionInt(Snapshot snapshot);
  static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
  static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
  						  PREDICATELOCKTARGETTAG *parent);
--- 416,423 ----
  
  static uint32 predicatelock_hash(const void *key, Size keysize);
  static void SummarizeOldestCommittedSxact(void);
! static Snapshot GetSafeSnapshot(Snapshot snapshot, Snapshot stemplate);
! static Snapshot RegisterSerializableTransactionInt(Snapshot snapshot, Snapshot stemplate);
  static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
  static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
  						  PREDICATELOCKTARGETTAG *parent);
*************** SummarizeOldestCommittedSxact(void)
*** 1487,1493 ****
   *		one of them could possibly create a conflict.
   */
  static Snapshot
! GetSafeSnapshot(Snapshot origSnapshot)
  {
  	Snapshot	snapshot;
  
--- 1487,1493 ----
   *		one of them could possibly create a conflict.
   */
  static Snapshot
! GetSafeSnapshot(Snapshot origSnapshot, Snapshot stemplate)
  {
  	Snapshot	snapshot;
  
*************** GetSafeSnapshot(Snapshot origSnapshot)
*** 1501,1507 ****
  		 * caller passed to us. It returns a copy of that snapshot and
  		 * registers it on TopTransactionResourceOwner.
  		 */
! 		snapshot = RegisterSerializableTransactionInt(origSnapshot);
  
  		if (MySerializableXact == InvalidSerializableXact)
  			return snapshot;	/* no concurrent r/w xacts; it's safe */
--- 1501,1507 ----
  		 * caller passed to us. It returns a copy of that snapshot and
  		 * registers it on TopTransactionResourceOwner.
  		 */
! 		snapshot = RegisterSerializableTransactionInt(origSnapshot, stemplate);
  
  		if (MySerializableXact == InvalidSerializableXact)
  			return snapshot;	/* no concurrent r/w xacts; it's safe */
*************** GetSafeSnapshot(Snapshot origSnapshot)
*** 1554,1560 ****
   * It should be current for this process and be contained in PredXact.
   */
  Snapshot
! RegisterSerializableTransaction(Snapshot snapshot)
  {
  	Assert(IsolationIsSerializable());
  
--- 1554,1560 ----
   * It should be current for this process and be contained in PredXact.
   */
  Snapshot
! RegisterSerializableTransaction(Snapshot snapshot, Snapshot stemplate)
  {
  	Assert(IsolationIsSerializable());
  
*************** RegisterSerializableTransaction(Snapshot
*** 1564,1576 ****
  	 * thereby avoid all SSI overhead once it's running..
  	 */
  	if (XactReadOnly && XactDeferrable)
! 		return GetSafeSnapshot(snapshot);
  
! 	return RegisterSerializableTransactionInt(snapshot);
  }
  
  static Snapshot
! RegisterSerializableTransactionInt(Snapshot snapshot)
  {
  	PGPROC	   *proc;
  	VirtualTransactionId vxid;
--- 1564,1576 ----
  	 * thereby avoid all SSI overhead once it's running..
  	 */
  	if (XactReadOnly && XactDeferrable)
! 		return GetSafeSnapshot(snapshot, stemplate);
  
! 	return RegisterSerializableTransactionInt(snapshot, stemplate);
  }
  
  static Snapshot
! RegisterSerializableTransactionInt(Snapshot snapshot, Snapshot stemplate)
  {
  	PGPROC	   *proc;
  	VirtualTransactionId vxid;
*************** RegisterSerializableTransactionInt(Snaps
*** 1608,1614 ****
  	} while (!sxact);
  
  	/* Get and register a snapshot */
! 	snapshot = GetSnapshotData(snapshot);
  	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
  
  	/*
--- 1608,1614 ----
  	} while (!sxact);
  
  	/* Get and register a snapshot */
! 	snapshot = GetSnapshotData(snapshot, stemplate);
  	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
  
  	/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a71729c..9ffe3a1 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 72,77 ****
--- 72,78 ----
  #include "utils/plancache.h"
  #include "utils/portal.h"
  #include "utils/ps_status.h"
+ #include "utils/snapmgr.h"
  #include "utils/tzparser.h"
  #include "utils/xml.h"
  
*************** ExecSetVariableStmt(VariableSetStmt *stm
*** 6094,6099 ****
--- 6095,6109 ----
  	switch (stmt->kind)
  	{
  		case VAR_SET_VALUE:
+ 			if (strcmp(stmt->name, "TRANSACTION SNAPSHOT") == 0)
+ 			{
+ 				if (!ImportSnapshot(ExtractSetVariableArgs(stmt)))
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 							 errmsg("could not import the requested snapshot")));
+ 				break;
+ 			}
+ 			/* fallthrough */
  		case VAR_SET_CURRENT:
  			set_config_option(stmt->name,
  							  ExtractSetVariableArgs(stmt),
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index bb25ac6..5d3f492 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 25,36 ****
   */
  #include "postgres.h"
  
  #include "access/transam.h"
  #include "access/xact.h"
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
! #include "utils/memutils.h"
  #include "utils/memutils.h"
  #include "utils/snapmgr.h"
  #include "utils/tqual.h"
--- 25,42 ----
   */
  #include "postgres.h"
  
+ #include <sys/types.h>
+ #include <sys/stat.h>
+ #include <unistd.h>
+ 
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "miscadmin.h"
+ #include "storage/fd.h"
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
! #include "utils/builtins.h"
  #include "utils/memutils.h"
  #include "utils/snapmgr.h"
  #include "utils/tqual.h"
*************** static Snapshot CopySnapshot(Snapshot sn
*** 108,125 ****
  static void FreeSnapshot(Snapshot snapshot);
  static void SnapshotResetXmin(void);
  
  
  /*
!  * GetTransactionSnapshot
   *		Get the appropriate snapshot for a new query in a transaction.
   *
!  * Note that the return value may point at static storage that will be modified
!  * by future calls and by CommandCounterIncrement().  Callers should call
!  * RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
!  * used very long.
   */
! Snapshot
! GetTransactionSnapshot(void)
  {
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
--- 114,137 ----
  static void FreeSnapshot(Snapshot snapshot);
  static void SnapshotResetXmin(void);
  
+ /* What we need for exporting snapshots */
+ #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
+ #define XactExportFilePath(path, xid, num, suffix) \
+     snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%08X-%d%s", xid, num, suffix)
+ 
+ List *exportedSnapshots = NIL;
  
  /*
!  * GetTransactionSnapshotFromTemplate
   *		Get the appropriate snapshot for a new query in a transaction.
   *
!  * A template snapshot is passed for the synchronized snapshots feature.
!  * In that case we want to have a snapshot back that has the template's
!  * values. We just pass it along and the lower level functions take care
!  * of it.
   */
! static Snapshot
! GetTransactionSnapshotFromTemplate(Snapshot stemplate)
  {
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
*************** GetTransactionSnapshot(void)
*** 134,150 ****
  		if (IsolationUsesXactSnapshot())
  		{
  			if (IsolationIsSerializable())
! 				CurrentSnapshot = RegisterSerializableTransaction(&CurrentSnapshotData);
  			else
  			{
! 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  				CurrentSnapshot = RegisterSnapshotOnOwner(CurrentSnapshot,
  												TopTransactionResourceOwner);
  			}
  			registered_xact_snapshot = true;
  		}
  		else
! 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
--- 146,170 ----
  		if (IsolationUsesXactSnapshot())
  		{
  			if (IsolationIsSerializable())
! 				CurrentSnapshot = RegisterSerializableTransaction(&CurrentSnapshotData,
! 																  stemplate);
  			else
  			{
! 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, stemplate);
  				CurrentSnapshot = RegisterSnapshotOnOwner(CurrentSnapshot,
  												TopTransactionResourceOwner);
  			}
  			registered_xact_snapshot = true;
  		}
  		else
! 		{
! 			/*
! 			 * template is only used for the synchronized snapshot feature. Which in
! 			 * turn is only allowed for IsolationUsesXactSnapshot() == true transactions
! 			 */
! 			Assert(stemplate == InvalidSnapshot);
! 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, InvalidSnapshot);
! 		}
  
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
*************** GetTransactionSnapshot(void)
*** 153,164 ****
  	if (IsolationUsesXactSnapshot())
  		return CurrentSnapshot;
  
! 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  
  	return CurrentSnapshot;
  }
  
  /*
   * GetLatestSnapshot
   *		Get a snapshot that is up-to-date as of the current instant,
   *		even if we are executing in transaction-snapshot mode.
--- 173,205 ----
  	if (IsolationUsesXactSnapshot())
  		return CurrentSnapshot;
  
! 	/* see comment above */
! 	Assert(stemplate == InvalidSnapshot);
! 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, InvalidSnapshot);
  
  	return CurrentSnapshot;
  }
  
  /*
+  * GetTransactionSnapshot
+  *		Get the appropriate snapshot for a new query in a transaction.
+  *
+  * This is the public interface for anything different than the snapshot
+  * synchronization feature.
+  *
+  * Note that the return value may point at static storage that will be modified
+  * by future calls and by CommandCounterIncrement().  Callers should call
+  * RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
+  * used very long.
+  */
+ Snapshot
+ GetTransactionSnapshot(void)
+ {
+ 	return GetTransactionSnapshotFromTemplate(InvalidSnapshot);
+ }
+ 
+ 
+ /*
   * GetLatestSnapshot
   *		Get a snapshot that is up-to-date as of the current instant,
   *		even if we are executing in transaction-snapshot mode.
*************** GetLatestSnapshot(void)
*** 170,176 ****
  	if (!FirstSnapshotSet)
  		elog(ERROR, "no snapshot has been set");
  
! 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
  
  	return SecondarySnapshot;
  }
--- 211,217 ----
  	if (!FirstSnapshotSet)
  		elog(ERROR, "no snapshot has been set");
  
! 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData, InvalidSnapshot);
  
  	return SecondarySnapshot;
  }
*************** SnapshotSetCommandId(CommandId curcid)
*** 192,234 ****
  }
  
  /*
!  * CopySnapshot
!  *		Copy the given snapshot.
   *
!  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
!  * to 0.  The returned snapshot has the copied flag set.
   */
! static Snapshot
! CopySnapshot(Snapshot snapshot)
  {
- 	Snapshot	newsnap;
  	Size		subxipoff;
- 	Size		size;
- 
- 	Assert(snapshot != InvalidSnapshot);
  
! 	/* We allocate any XID arrays needed in the same palloc block. */
! 	size = subxipoff = sizeof(SnapshotData) +
! 		snapshot->xcnt * sizeof(TransactionId);
! 	if (snapshot->subxcnt > 0)
! 		size += snapshot->subxcnt * sizeof(TransactionId);
! 
! 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
! 	memcpy(newsnap, snapshot, sizeof(SnapshotData));
  
! 	newsnap->regd_count = 0;
! 	newsnap->active_count = 0;
! 	newsnap->copied = true;
  
  	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
! 		newsnap->xip = (TransactionId *) (newsnap + 1);
! 		memcpy(newsnap->xip, snapshot->xip,
  			   snapshot->xcnt * sizeof(TransactionId));
  	}
  	else
! 		newsnap->xip = NULL;
  
  	/*
  	 * Setup subXID array. Don't bother to copy it if it had overflowed,
--- 233,265 ----
  }
  
  /*
!  * CopySnapshotOnto
!  *      Copy the given snapshot onto an already sufficiently allocated other
!  *      snapshot.
   *
!  * Return the modified snapshot (onto).
   */
! Snapshot
! CopySnapshotOnto(Snapshot snapshot, Snapshot onto)
  {
  	Size		subxipoff;
  
! 	subxipoff = sizeof(SnapshotData) + snapshot->xcnt * sizeof(TransactionId);
! 	memcpy(onto, snapshot, sizeof(SnapshotData));
  
! 	onto->regd_count = 0;
! 	onto->active_count = 0;
! 	onto->copied = true;
  
  	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
! 		onto->xip = (TransactionId *) (onto + 1);
! 		memcpy(onto->xip, snapshot->xip,
  			   snapshot->xcnt * sizeof(TransactionId));
  	}
  	else
! 		onto->xip = NULL;
  
  	/*
  	 * Setup subXID array. Don't bother to copy it if it had overflowed,
*************** CopySnapshot(Snapshot snapshot)
*** 239,252 ****
  	if (snapshot->subxcnt > 0 &&
  		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
  	{
! 		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
! 		memcpy(newsnap->subxip, snapshot->subxip,
  			   snapshot->subxcnt * sizeof(TransactionId));
  	}
  	else
! 		newsnap->subxip = NULL;
  
! 	return newsnap;
  }
  
  /*
--- 270,308 ----
  	if (snapshot->subxcnt > 0 &&
  		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
  	{
! 		onto->subxip = (TransactionId *) ((char *) onto + subxipoff);
! 		memcpy(onto->subxip, snapshot->subxip,
  			   snapshot->subxcnt * sizeof(TransactionId));
  	}
  	else
! 		onto->subxip = NULL;
  
! 	return onto;
! }
! 
! /*
!  * CopySnapshot
!  *		Copy the given snapshot.
!  *
!  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
!  * to 0.  The returned snapshot has the copied flag set.
!  */
! static Snapshot
! CopySnapshot(Snapshot snapshot)
! {
! 	Snapshot	newsnap;
! 	Size		size;
! 
! 	Assert(snapshot != InvalidSnapshot);
! 
! 	/* We allocate any XID arrays needed in the same palloc block. */
! 	size = sizeof(SnapshotData) + snapshot->xcnt * sizeof(TransactionId);
! 	if (snapshot->subxcnt > 0)
! 		size += snapshot->subxcnt * sizeof(TransactionId);
! 
! 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
! 
! 	return CopySnapshotOnto(snapshot, newsnap);
  }
  
  /*
*************** AtEOXact_Snapshot(bool isCommit)
*** 576,578 ****
--- 632,1026 ----
  	FirstSnapshotSet = false;
  	registered_xact_snapshot = false;
  }
+ 
+ /*
+  * PreCommit_Snapshot
+  *		Cleans up exported snapshots (this needs to happen before we update
+  *		our MyProc entry, hence it is in PreCommit).
+  */
+ void
+ PreCommit_Snapshot(void)
+ {
+ 	ListCell   *snapshot;
+ 	int			i;
+ 	char		buf[MAXPGPATH];
+ 
+ 	if (exportedSnapshots == NIL)
+ 		return;
+ 
+ 	Assert(list_length(exportedSnapshots) > 0);
+ 	Assert(TransactionIdIsValid(GetTopTransactionIdIfAny()));
+ 
+ 	for(i = 1; i <= list_length(exportedSnapshots); i++)
+ 	{
+ 		XactExportFilePath(buf, GetTopTransactionId(), i, "");
+ 		unlink(buf);
+ 	}
+ 
+ 	foreach(snapshot, exportedSnapshots)
+ 		UnregisterSnapshotFromOwner(lfirst(snapshot), TopTransactionResourceOwner);
+ 
+ 	exportedSnapshots = NIL;
+ }
+ 
+ /*
+  * DeleteAllExportedSnapshotFiles
+  *		Cleans up any files that have been left behind by a crashed backend
+  *		that had exported snapshots before it died.
+  */
+ void
+ DeleteAllExportedSnapshotFiles(void)
+ {
+ 	char		buf[MAXPGPATH];
+ 	DIR		   *s_dir;
+ 	struct dirent *s_de;
+ 
+ 	if (!(s_dir = AllocateDir(SNAPSHOT_EXPORT_DIR)))
+ 	{
+ 		/*
+ 		 * We really should have that directory in a sane cluster setup. But
+ 		 * then again if we don't it's not fatal enough to make it FATAL.
+ 		 */
+ 		elog(WARNING,
+ 			"could not open directory \"%s\": %m",
+ 			SNAPSHOT_EXPORT_DIR);
+ 		return;
+ 	}
+ 
+ 	while ((s_de = ReadDir(s_dir, SNAPSHOT_EXPORT_DIR)) != NULL)
+ 	{
+ 		if (strcmp(s_de->d_name, ".") == 0 ||
+ 			strcmp(s_de->d_name, "..") == 0)
+ 			continue;
+ 
+ 		snprintf(buf, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", s_de->d_name);
+ 		unlink(buf);
+ 	}
+ 	FreeDir(s_dir);
+ }
+ 
+ /*
+  * ExportSnapshot
+  *		Export the snapshot to a file so that other backends can import the same
+  *		snapshot.
+  *		Returns the token (the file name) that can be used to import this
+  *		snapshot.
+  */
+ static char *
+ ExportSnapshot(Snapshot snapshot)
+ {
+ #define SNAPSHOT_APPEND(x, y) (appendStringInfo(&buf, (x), (y)))
+ 	TransactionId *children, topXid;
+ 	FILE	   *f;
+ 	int			i;
+ 	int			nchildren;
+ 	MemoryContext oldcxt;
+ 	char		path[MAXPGPATH];
+ 	char		pathtmp[MAXPGPATH];
+ 	StringInfoData buf;
+ 
+ 	Assert(IsTransactionState());
+ 
+ 	/*
+ 	 * This will also assign a transaction id if we do not yet have one.
+ 	 */
+ 	topXid = GetTopTransactionId();
+ 
+ 	Assert(TransactionIdIsValid(GetTopTransactionIdIfAny()));
+ 
+ 	/*
+ 	 * We cannot export a snapshot from a subtransaction because in a
+ 	 * subtransaction we don't see our open subxip values in the snapshot so
+ 	 * they would be missing in the backend applying it.
+ 	 */
+ 	if (GetCurrentTransactionNestLevel() != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ 				 errmsg("cannot export a snapshot from a subtransaction")));
+ 
+ 	/*
+ 	 * We do however see our already committed subxip values and add them to
+ 	 * the subxip array.
+ 	 */
+ 	nchildren = xactGetCommittedChildren(&children);
+ 
+ 	initStringInfo(&buf);
+ 
+ 	/* Write up all the data that we return */
+ 	SNAPSHOT_APPEND("xid:%d ", topXid);
+ 	SNAPSHOT_APPEND("xmi:%d ", snapshot->xmin);
+ 	SNAPSHOT_APPEND("xma:%d ", snapshot->xmax);
+ 	/* Include our own transaction ID into the count. */
+ 	SNAPSHOT_APPEND("xcnt:%d ", snapshot->xcnt + 1);
+ 	for (i = 0; i < snapshot->xcnt; i++)
+ 		SNAPSHOT_APPEND("xip:%d ", snapshot->xip[i]);
+ 	/*
+ 	 * Finally add our own XID, since by definition we will still be running
+ 	 * when the other transaction takes over the snapshot.
+ 	 */
+ 	SNAPSHOT_APPEND("xip:%d ", topXid);
+ 	if (snapshot->suboverflowed || snapshot->subxcnt + nchildren > TOTAL_MAX_CACHED_SUBXIDS)
+ 		SNAPSHOT_APPEND("sof:%d ", 1);
+ 	else
+ 	{
+ 		SNAPSHOT_APPEND("sxcnt:%d ", snapshot->subxcnt + nchildren);
+ 		for (i = 0; i < snapshot->subxcnt; i++)
+ 			SNAPSHOT_APPEND("sxp:%d ", snapshot->subxip[i]);
+ 		/* Add already committed subtransactions. */
+ 		for (i = 0; i < nchildren; i++)
+ 			SNAPSHOT_APPEND("sxp:%d ", children[i]);
+ 	}
+ 
+ 	/*
+ 	 * buf ends with a trailing space but we leave it in for simplicity. The
+ 	 * parsing routines also depend on it.
+ 	 */
+ 
+ 	/* Register the snapshot and add it to the list of exported snapshots */
+ 	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
+ 
+ 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+ 	exportedSnapshots = lappend(exportedSnapshots, snapshot);
+ 	MemoryContextSwitchTo(oldcxt);
+ 
+ 	XactExportFilePath(pathtmp, topXid, list_length(exportedSnapshots), ".tmp");
+ 	if (!(f = AllocateFile(pathtmp, PG_BINARY_W)))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not create file \"%s\": %m", pathtmp)));
+ 
+ 	if (fwrite(buf.data, buf.len, 1, f) != 1)
+ 		/* Aborting the transaction will also call FreeFile() */
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", pathtmp)));
+ 
+ 	if (FreeFile(f))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", pathtmp)));
+ 
+ 	/*
+ 	 * Now that we have written everything into a .tmp file we rename the file
+ 	 * and remove the .tmp suffix. Our filename is predictable and we're
+ 	 * paranoid enough to not let us read a partially written file (we can't
+ 	 * read a .tmp file because this would fail the valid characters check in
+ 	 * ImportSnapshot).
+ 	 */
+ 	XactExportFilePath(path, topXid, list_length(exportedSnapshots), "");
+ 
+ 	if (rename(pathtmp, path) < 0)
+ 		ereport(WARNING,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 						pathtmp, path)));
+ 
+ 	/*
+ 	 * The basename of the file is what we return from pg_export_snapshot().
+ 	 * It's already in path in a textual format and we know that the path
+ 	 * starts with SNAPSHOT_EXPORT_DIR. Skip over the prefix and over the
+ 	 * slash and pstrdup it to not return a local variable.
+ 	 */
+ 	return pstrdup(path + strlen(SNAPSHOT_EXPORT_DIR) + 1);
+ #undef SNAPSHOT_APPEND
+ }
+ 
+ /*
+  * Poor man's type independent parser. We only use it in the three functions
+  * below so there's no need to get ambitious about putting extra (x) around the
+  * arguments.
+  */
+ #define SNAPSHOT_PARSE(valFunc, inFunc, type, strpp, prfx, notfound)		\
+ 	do {																	\
+ 		char	   *n, *p = strstr(*strpp, prfx);							\
+ 		type		v;														\
+ 																			\
+ 		if (!p)																\
+ 			return notfound;												\
+ 		p += strlen(prfx);													\
+ 		n = strchr(p, ' ');													\
+ 		if (!n)																\
+ 			return notfound;												\
+ 		*n = '\0';															\
+ 		v = valFunc(DirectFunctionCall1(inFunc, CStringGetDatum(p)));		\
+ 		*strpp = n + 1;														\
+ 		return v;															\
+ 	} while (0);
+ 
+ static int
+ parseIntFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetInt32, int4in, int, s, prefix, 0);
+ }
+ 
+ static bool
+ parseBoolFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetInt32, int4in, bool, s, prefix, false);
+ }
+ 
+ static TransactionId
+ parseXactFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetTransactionId, xidin, TransactionId,
+ 				   s, prefix, InvalidTransactionId);
+ }
+ 
+ #undef SNAPSHOT_PARSE
+ 
+ /*
+  * ImportSnapshot
+  *      Import a previously exported snapshot. We expect that whatever we get
+  *      is a filename in SNAPSHOT_EXPORT_DIR. Load the snapshot from that file.
+  *      This is called from "START TRANSACTION (SNAPSHOT = 'foo')" so we always
+  *      start fresh from zero with respect to the transaction state that we
+  *      work on.  Returns true on success and false on failure.
+  */
+ bool
+ ImportSnapshot(char *idstr)
+ {
+ 	char		path[MAXPGPATH];
+ 	FILE	   *f;
+ 	int			i;
+ 	char	   *s;
+ 	struct stat	stat_buf;
+ 	int			sxcnt, xcnt;
+ 	TransactionId xid, origXid, myXid;
+ 	SnapshotData snapshot = {HeapTupleSatisfiesMVCC};
+ 
+ 	if (FirstSnapshotSet || GetCurrentTransactionNestLevel() != 1)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ 			 errmsg("SET TRANSACTION SNAPSHOT must be called before any query")));
+ 
+ 	/*
+ 	 * If we were in read committed mode then the next query would execute with a
+ 	 * new snapshot thus making this function call quite useless.
+ 	 */
+ 	if (!IsolationUsesXactSnapshot())
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 			 errmsg("A snapshot importing transaction must have ISOLATION "
+ 					"LEVEL SERIALIZABLE or ISOLATION LEVEL REPEATABLE READ")));
+ 
+ 	/* We're lucky to always start off from a pretty clean state */
+ 	Assert(IsTransactionState());
+ 	Assert(GetCurrentTransactionNestLevel() == 1);
+ 	Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+ 	Assert(CurrentSnapshot == NULL);
+ 	Assert(SecondarySnapshot == NULL);
+ 	Assert(registered_xact_snapshot == false);
+ 
+ 	/* verify the identifier, only 0-9,A-F and a hyphen are allowed... */
+ 	s = idstr;
+ 	while (*s)
+ 	{
+ 		if (!isdigit(*s) && !(*s >= 'A' && *s <= 'F') && *s != '-')
+ 			return false;
+ 		s++;
+ 	}
+ 
+ 	/*
+ 	 * Assign a transaction id. We only do this to detect a possible
+ 	 * transaction id wraparound which is somewhere between unlikely
+ 	 * and impossible...
+ 	 */
+ 	myXid = GetTopTransactionId();
+ 
+ 	snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", idstr);
+ 
+ 	/* get the size of the file so that we know how much memory we need */
+ 	if (stat(path, &stat_buf) != 0)
+ 		return false;
+ 
+ 	if (!(f = AllocateFile(path, PG_BINARY_R)))
+ 		return false;
+ 
+ 	s = palloc(stat_buf.st_size + 1);
+ 	if (fread(s, stat_buf.st_size, 1, f) != 1)
+ 		return false;
+ 
+ 	s[stat_buf.st_size] = '\0';
+ 
+ 	FreeFile(f);
+ 
+ 	origXid = parseXactFromText(&s, "xid:");
+ 
+ 	snapshot.xmin = parseXactFromText(&s, "xmi:");
+ 	Assert(snapshot.xmin != InvalidTransactionId);
+ 	snapshot.xmax = parseXactFromText(&s, "xma:");
+ 	Assert(snapshot.xmax != InvalidTransactionId);
+ 
+ 	xcnt = parseIntFromText(&s, "xcnt:");
+ 	/*
+ 	 * This snapshot only serves as a template, there is no need for it to have
+ 	 * maxProcs entries, so let's make it just as large as we need it.
+ 	 */
+ 	snapshot.xip = palloc(xcnt * sizeof(TransactionId));
+ 
+ 	i = 0;
+ 	while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
+ 		snapshot.xip[i++] = xid;
+ 	snapshot.xcnt = i;
+ 	Assert(snapshot.xcnt == xcnt);
+ 
+ 	/*
+ 	 * We only write "sof:1" if the snapshot overflowed. If not, then there is
+ 	 * no "sof:x" entry at all and parseBoolFromText() will return false.
+ 	 */
+ 	snapshot.suboverflowed = parseBoolFromText(&s, "sof:");
+ 
+ 	if (!snapshot.suboverflowed)
+ 	{
+ 		sxcnt = parseIntFromText(&s, "sxcnt:");
+ 		snapshot.subxip = palloc(sxcnt * sizeof(TransactionId));
+ 
+ 		i = 0;
+ 		while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
+ 			snapshot.subxip[i++] = xid;
+ 		snapshot.subxcnt = i;
+ 		Assert(snapshot.subxcnt == sxcnt);
+ 	} else {
+ 		snapshot.subxip = NULL;
+ 		snapshot.subxcnt = 0;
+ 	}
+ 
+ 	/* complete the snapshot data structure */
+ 	snapshot.curcid = 0;
+ 	snapshot.takenDuringRecovery = RecoveryInProgress();
+ 
+ 	/*
+ 	 * Note that MyProc->xmin can go backwards here. However this is safe
+ 	 * because the xmin we set here is the same as in the backend's proc->xmin
+ 	 * whose snapshot we are copying. At this very moment, anybody computing a
+ 	 * minimum will calculate at least this xmin as the overall xmin with or
+ 	 * without us setting MyProc->xmin to this value.
+ 	 */
+ 	LWLockAcquire(ProcArrayLock, LW_SHARED);
+ 	MyProc->xmin = snapshot.xmin;
+ 	LWLockRelease(ProcArrayLock);
+ 
+ 	/* bail out if the original transaction is not running anymore... */
+ 	if (!TransactionIdIsInProgress(origXid) || TransactionIdPrecedes(myXid, origXid))
+ 		return false;
+ 
+ 	/*
+ 	 * Install the snapshot as if we got it through GetTransactionSnapshot().
+ 	 * This will set up CurrentSnapshot and also set up the predicate locks for a
+ 	 * serializable transaction.
+ 	 */
+ 	GetTransactionSnapshotFromTemplate(&snapshot);
+ 	return true;
+ }
+ 
+ Datum
+ pg_export_snapshot(PG_FUNCTION_ARGS)
+ {
+ 	char	   *snapshotData;
+ 
+ 	RequireTransactionChain(true, "pg_export_snapshot()");
+ 
+ 	snapshotData = ExportSnapshot(GetTransactionSnapshot());
+ 	PG_RETURN_TEXT_P(cstring_to_text(snapshotData));
+ }
+ 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e535fda..0e981cd 100644
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
*************** main(int argc, char *argv[])
*** 2557,2562 ****
--- 2557,2563 ----
  		"pg_serial",
  		"pg_subtrans",
  		"pg_twophase",
+ 		"pg_snapshots",
  		"pg_multixact/members",
  		"pg_multixact/offsets",
  		"base",
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 0019df5..4706f10 100644
*** a/src/include/access/twophase.h
--- b/src/include/access/twophase.h
***************
*** 22,30 ****
   */
  typedef struct GlobalTransactionData *GlobalTransaction;
  
- /* GUC variable */
- extern int	max_prepared_xacts;
- 
  extern Size TwoPhaseShmemSize(void);
  extern void TwoPhaseShmemInit(void);
  
--- 22,27 ----
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 96f43fe..a4e0387 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 2853,2858 ****
--- 2853,2860 ----
  DESCR("cancel a server process' current query");
  DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
  DESCR("terminate a server process");
+ DATA(insert OID = 3122 ( pg_export_snapshot		PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
+ DESCR("export a snapshot");
  DATA(insert OID = 2172 ( pg_start_backup		PGNSP PGUID 12 1 0 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
  DESCR("prepare for taking an online backup");
  DATA(insert OID = 2173 ( pg_stop_backup			PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9d19417..15326cf 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*************** extern PGDLLIMPORT int NBuffers;
*** 134,139 ****
--- 134,142 ----
  extern int	MaxBackends;
  extern int	MaxConnections;
  
+ /* GUC variable */
+ extern int	max_prepared_xacts;
+ 
  extern PGDLLIMPORT int MyProcPid;
  extern PGDLLIMPORT pg_time_t MyStartTime;
  extern PGDLLIMPORT struct Port *MyProcPort;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 12c2faf..3d170bc 100644
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
*************** PG_KEYWORD("show", SHOW, UNRESERVED_KEYW
*** 337,342 ****
--- 337,343 ----
  PG_KEYWORD("similar", SIMILAR, TYPE_FUNC_NAME_KEYWORD)
  PG_KEYWORD("simple", SIMPLE, UNRESERVED_KEYWORD)
  PG_KEYWORD("smallint", SMALLINT, COL_NAME_KEYWORD)
+ PG_KEYWORD("snapshot", SNAPSHOT, UNRESERVED_KEYWORD)
  PG_KEYWORD("some", SOME, RESERVED_KEYWORD)
  PG_KEYWORD("stable", STABLE, UNRESERVED_KEYWORD)
  PG_KEYWORD("standalone", STANDALONE_P, UNRESERVED_KEYWORD)
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 5ddbc1d..f4f0303 100644
*** a/src/include/storage/predicate.h
--- b/src/include/storage/predicate.h
*************** extern void CheckPointPredicate(void);
*** 42,48 ****
  extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
  
  /* predicate lock maintenance */
! extern Snapshot RegisterSerializableTransaction(Snapshot snapshot);
  extern void RegisterPredicateLockingXid(TransactionId xid);
  extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
  extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
--- 42,48 ----
  extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
  
  /* predicate lock maintenance */
! extern Snapshot RegisterSerializableTransaction(Snapshot snapshot, Snapshot stemplate);
  extern void RegisterPredicateLockingXid(TransactionId xid);
  extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
  extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a11d438..dfe57aa 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern void ExpireOldKnownAssignedTransa
*** 39,45 ****
  
  extern RunningTransactions GetRunningTransactionData(void);
  
! extern Snapshot GetSnapshotData(Snapshot snapshot);
  
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
--- 39,45 ----
  
  extern RunningTransactions GetRunningTransactionData(void);
  
! extern Snapshot GetSnapshotData(Snapshot snapshot, Snapshot stemplate);
  
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
*************** extern void XidCacheRemoveRunningXids(Tr
*** 69,72 ****
--- 69,78 ----
  						  int nxids, const TransactionId *xids,
  						  TransactionId latestXid);
  
+ /* Size of the ProcArray structure itself */
+ #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+ 
+ #define TOTAL_MAX_CACHED_SUBXIDS \
+ 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+ 
  #endif   /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index c969a37..d25b1cd 100644
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
*************** extern TransactionId TransactionXmin;
*** 22,27 ****
--- 22,29 ----
  extern TransactionId RecentXmin;
  extern TransactionId RecentGlobalXmin;
  
+ extern List *exportedSnapshots;
+ 
  extern Snapshot GetTransactionSnapshot(void);
  extern Snapshot GetLatestSnapshot(void);
  extern void SnapshotSetCommandId(CommandId curcid);
*************** extern void UpdateActiveSnapshotCommandI
*** 32,46 ****
--- 34,54 ----
  extern void PopActiveSnapshot(void);
  extern Snapshot GetActiveSnapshot(void);
  extern bool ActiveSnapshotSet(void);
+ extern Snapshot CopySnapshotOnto(Snapshot onto, Snapshot snapshot);
  
  extern Snapshot RegisterSnapshot(Snapshot snapshot);
  extern void UnregisterSnapshot(Snapshot snapshot);
  extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
  extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
  
+ extern void PreCommit_Snapshot(void);
  extern void AtSubCommit_Snapshot(int level);
  extern void AtSubAbort_Snapshot(int level);
  extern void AtEarlyCommit_Snapshot(void);
  extern void AtEOXact_Snapshot(bool isCommit);
  
+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
+ extern bool ImportSnapshot(char *idstr);
+ extern void DeleteAllExportedSnapshotFiles(void);
+ 
  #endif   /* SNAPMGR_H */

test-ssi.pltext/x-perl; charset=US-ASCII; name=test-ssi.plDownload

#31

Marko Tiikkaja

marko.tiikkaja@2ndquadrant.com

over 14 years ago

In reply to: Joachim Wieland (#30)

Re: synchronized snapshots

Hi Joachim,

On 14/09/2011 05:37, Joachim Wieland wrote:

Here is a new version of this patch

In a sequence such as this:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
INSERT INTO foo VALUES (-1);
SELECT pg_export_snapshot();

the row added to "foo" is not visible in the exported snapshot. If
that's the desired behaviour, I think it should be mentioned in the
documentation.

I can make a patched backend die with an assertion failure by trying to
export a snapshot after rolling back a transaction which exported a
snapshot. Seems like no cleanup is done at transaction abort.

I think that trying to import a snapshot that doesn't exist deserves a
better error message. There's currently no way for the user to know
that the snapshot didn't exist, other than looking at the SQLSTATE
(22023), and even that doesn't tell me a whole lot without looking at
the manual.

Finally, the comment in ImportSnapshot() still mentions the old syntax.

Other than these four problems, the patch looks good.

--
Marko Tiikkaja http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#32

Joachim Wieland

joe@mcknight.de

over 14 years ago

In reply to: Marko Tiikkaja (#31)

Re: synchronized snapshots

Hi Marko,

On Wed, Sep 28, 2011 at 2:29 AM, Marko Tiikkaja
<marko.tiikkaja@2ndquadrant.com> wrote:

In a sequence such as this:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
INSERT INTO foo VALUES (-1);
SELECT pg_export_snapshot();

the row added to "foo" is not visible in the exported snapshot. If that's
the desired behaviour, I think it should be mentioned in the documentation.

Yes, that's the desired behaviour, the patch add this paragraph to the
documentation already:

"Also note that even after the synchronization both clients still run
their own independent transactions. As a consequence, even though
synchronized with respect to reading pre-existing data, both
transactions won't be able to see each other's uncommitted data."

I'll take a look at the other issues and update the patch either
tonight or tomorrow.

Thank you,
Joachim

#33

Joachim Wieland

joe@mcknight.de

over 14 years ago

In reply to: Marko Tiikkaja (#31)

1 attachment(s)

Re: synchronized snapshots

On Wed, Sep 28, 2011 at 2:29 AM, Marko Tiikkaja
<marko.tiikkaja@2ndquadrant.com> wrote:

Other than these four problems, the patch looks good.

The attached patch addresses the reported issues.

Thanks for the review,
Joachim

Attachments:

syncSnapshots.3.difftext/x-patch; charset=US-ASCII; name=syncSnapshots.3.diffDownload

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0b6a109..ce8653e 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** FOR EACH ROW EXECUTE PROCEDURE suppress_
*** 15023,15026 ****
--- 15023,15100 ----
          <xref linkend="SQL-CREATETRIGGER">.
      </para>
    </sect1>
+ 
+   <sect1 id="functions-snapshotsync">
+    <title>Snapshot Synchronization Functions</title>
+ 
+    <indexterm>
+      <primary>pg_export_snapshot</primary>
+    </indexterm>
+ 
+    <para>
+      <productname>PostgreSQL</> allows different sessions to synchronize their
+      snapshots. A database snapshot determines which data is visible to
+      the client that is using this snapshot. Synchronized snapshots are necessary when
+      two clients need to see the same content in the database. If these two clients
+      just connected to the database and opened their transactions, then they could
+      never be sure that there was no data modification right between both
+      connections.
+    </para>
+    <para>
+      As a solution, <productname>PostgreSQL</> offers the function
+      <function>pg_export_snapshot</> which saves the snapshot internally and
+      from then on until the end of the saving transaction, the snapshot can be
+      copied into a a new transaction with the <xref linkend="sql-set-transaction">
+      command to open a second transaction with the
+      exact same snapshot. Now both transactions are guaranteed to see the exact same
+      data even though they might have connected at different times.
+    </para>
+    <para>
+      Note that a snapshot can only be used to start a new transaction as long
+      as the transaction that originally saved it is held open. Also note that even
+      after the synchronization both clients still run their own independent
+      transactions. As a consequence, even though synchronized with respect to
+      reading pre-existing data, both transactions won't be able to see each other's
+      uncommitted data.
+    </para>
+    <table id="functions-snapshot-synchronization">
+     <title>Snapshot Synchronization Functions</title>
+     <tgroup cols="3">
+      <thead>
+       <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+       </row>
+      </thead>
+ 
+      <tbody>
+       <row>
+        <entry>
+         <literal><function>pg_export_snapshot()</function></literal>
+        </entry>
+        <entry><type>text</type></entry>
+        <entry>Save the snapshot and return its identifier</entry>
+       </row>
+      </tbody>
+     </tgroup>
+    </table>
+ 
+    <para>
+       The function <function>pg_export_snapshot</> does not take an argument
+       and returns the snapshot's identifier as <type>text</type> data. Internally the
+       function will save the snapshot data to a file so that it can be retrieved
+       from a different backend process later on. Note that as soon as the
+       transaction ends, any saved snapshots become invalid and their
+       identifiers cannot be used in other transactions anymore. If the function
+       has been executed, the transaction cannot be prepared anymore with <xref
+       linkend="sql-prepare-transaction">.
+    </para>
+ <programlisting>
+ SELECT pg_export_snapshot();
+ 
+  pg_export_snapshot
+ --------------------
+  000003A1-1
+ (1 row)
+ </programlisting>
+   </sect1>
  </chapter>
+ 
diff --git a/doc/src/sgml/ref/set_transaction.sgml b/doc/src/sgml/ref/set_transaction.sgml
index e28a7e1..059b94b 100644
*** a/doc/src/sgml/ref/set_transaction.sgml
--- b/doc/src/sgml/ref/set_transaction.sgml
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 40,45 ****
--- 40,46 ----
      ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
      READ WRITE | READ ONLY
      [ NOT ] DEFERRABLE
+     SNAPSHOT <replaceable class="parameter">snapshot-id</replaceable>
  </synopsis>
   </refsynopsisdiv>
  
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 146,151 ****
--- 147,167 ----
     contributing to or being canceled by a serialization failure.  This mode
     is well suited for long-running reports or backups.
    </para>
+ 
+   <para>
+    The <literal>SNAPSHOT</literal> property allows a new transaction to run
+    with the same snapshot as an already running transaction.  A call to
+    <literal>pg_export_snapshot</literal> (see <xref
+    linkend="functions-snapshotsync">) must have been executed in the other
+    transaction. This function returns a snapshot id which must be passed as a
+    parameter to this command to create a second transaction running with the same
+    snapshot. You also need to make the transaction <literal>ISOLATION LEVEL
+    SERIALIZABLE</literal> or <literal>ISOLATION LEVEL REPEATABLE READ</literal>.
+    If the transaction has already executed a query, started a subtransaction or
+    assigned a snapshot, no further snapshot assignment is possible in this
+    transaction.
+   </para>
+ 
   </refsect1>
  
   <refsect1>
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 178,183 ****
--- 194,225 ----
    </para>
   </refsect1>
  
+  <refsect1>
+   <title>Examples</title>
+ 
+   <para>
+    To begin a new transaction block with the same snapshot as an already
+    existing transaction, first export the snapshot from the existing
+    transaction. This will return the snapshot id:
+ 
+ <programlisting>
+ # SELECT pg_export_snapshot();
+  pg_export_snapshot
+ --------------------
+  000003A1-1
+ (1 row)
+ </programlisting>
+ 
+    Then reference this snapshot id as the first command in the newly opened
+    transaction:
+ 
+ <programlisting>
+ BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ SET TRANSACTION SNAPSHOT '000003A1-1';
+ </programlisting>
+   </para>
+  </refsect1>
+ 
   <refsect1 id="R1-SQL-SET-TRANSACTION-3">
    <title>Compatibility</title>
  
*************** SET SESSION CHARACTERISTICS AS TRANSACTI
*** 198,206 ****
    </para>
  
    <para>
!    The <literal>DEFERRABLE</literal>
     <replaceable class="parameter">transaction_mode</replaceable>
!    is a <productname>PostgreSQL</productname> language extension.
    </para>
  
    <para>
--- 240,250 ----
    </para>
  
    <para>
!    <literal>DEFERRABLE</literal>
     <replaceable class="parameter">transaction_mode</replaceable>
!    and <literal>SNAPSHOT</literal> <replaceable
!    class="parameter">snapshot-id</replaceable> are
!    <productname>PostgreSQL</productname> language extensions.
    </para>
  
    <para>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3dab45c..7ce516e 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** CommitTransaction(void)
*** 1852,1857 ****
--- 1852,1863 ----
  	 */
  	PreCommit_Notify();
  
+ 	/*
+ 	 * Cleans up exported snapshots (this needs to happen before we update
+ 	 * our MyProc entry).
+ 	 */
+ 	InvalidateExportedSnapshots();
+ 
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
*************** PrepareTransaction(void)
*** 2067,2072 ****
--- 2073,2083 ----
  				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  				 errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
  
+ 	if (exportedSnapshots)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				 errmsg("cannot PREPARE a transaction that has exported snapshots")));
+ 
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
*************** AbortTransaction(void)
*** 2228,2233 ****
--- 2239,2247 ----
  	/* Prevent cancel/die interrupt while cleaning up */
  	HOLD_INTERRUPTS();
  
+ 	/* Invalidate any exported snapshots */
+ 	InvalidateExportedSnapshots();
+ 
  	/* Make sure we have a valid memory context and resource owner */
  	AtAbort_Memory();
  	AtAbort_ResourceOwner();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f65ddc..331a564 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 58,63 ****
--- 58,64 ----
  #include "utils/guc.h"
  #include "utils/ps_status.h"
  #include "utils/relmapper.h"
+ #include "utils/snapmgr.h"
  #include "utils/timestamp.h"
  #include "pg_trace.h"
  
*************** StartupXLOG(void)
*** 6373,6378 ****
--- 6374,6384 ----
  		CheckRequiredParameterValues();
  
  		/*
+ 		 * We can delete any saved transaction snapshots that still exist
+ 		 */
+ 		DeleteAllExportedSnapshotFiles();
+ 
+ 		/*
  		 * We're in recovery, so unlogged relations relations may be trashed
  		 * and must be reset.  This should be done BEFORE allowing Hot Standby
  		 * connections, so that read-only backends don't try to read whatever
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e9f3896..e1ab161 100644
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*************** static void processCASbits(int cas_bits,
*** 553,560 ****
  
  	SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
  	SERIALIZABLE SERVER SESSION SESSION_USER SET SETOF SHARE
! 	SHOW SIMILAR SIMPLE SMALLINT SOME STABLE STANDALONE_P START STATEMENT
! 	STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
  	SYMMETRIC SYSID SYSTEM_P
  
  	TABLE TABLES TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN TIME TIMESTAMP
--- 553,560 ----
  
  	SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SEQUENCES
  	SERIALIZABLE SERVER SESSION SESSION_USER SET SETOF SHARE
! 	SHOW SIMILAR SIMPLE SMALLINT SNAPSHOT SOME STABLE STANDALONE_P START
! 	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
  	SYMMETRIC SYSID SYSTEM_P
  
  	TABLE TABLES TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN TIME TIMESTAMP
*************** set_rest:	/* Generic SET syntaxes: */
*** 1286,1291 ****
--- 1286,1299 ----
  					n->args = $2;
  					$$ = n;
  				}
+ 			| TRANSACTION SNAPSHOT Sconst
+ 				{
+ 					VariableSetStmt *n = makeNode(VariableSetStmt);
+ 					n->kind = VAR_SET_VALUE;
+ 					n->name = "TRANSACTION SNAPSHOT";
+ 					n->args = list_make1(makeStringConst($3, @3));
+ 					$$ = n;
+ 				}
  			| SESSION CHARACTERISTICS AS TRANSACTION transaction_mode_list
  				{
  					VariableSetStmt *n = makeNode(VariableSetStmt);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 9489012..b665d75 100644
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
*************** ProcArrayShmemSize(void)
*** 166,174 ****
  {
  	Size		size;
  
- 	/* Size of the ProcArray structure itself */
- #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
- 
  	size = offsetof(ProcArrayStruct, procs);
  	size = add_size(size, mul_size(sizeof(PGPROC *), PROCARRAY_MAXPROCS));
  
--- 166,171 ----
*************** ProcArrayShmemSize(void)
*** 179,192 ****
  	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
  	 * main structures created in those functions must be identically sized,
  	 * since we may at times copy the whole of the data structures around. We
! 	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
  	 *
  	 * Ideally we'd only create this structure if we were actually doing hot
  	 * standby in the current run, but we don't know that yet at the time
  	 * shared memory is being set up.
  	 */
- #define TOTAL_MAX_CACHED_SUBXIDS \
- 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
  
  	if (EnableHotStandby)
  	{
--- 176,187 ----
  	 * TransactionIdIsInProgress() and GetRunningTransactionData(). All of the
  	 * main structures created in those functions must be identically sized,
  	 * since we may at times copy the whole of the data structures around. We
! 	 * refer to this size as TOTAL_MAX_CACHED_SUBXIDS, defined in procarray.h.
  	 *
  	 * Ideally we'd only create this structure if we were actually doing hot
  	 * standby in the current run, but we don't know that yet at the time
  	 * shared memory is being set up.
  	 */
  
  	if (EnableHotStandby)
  	{
*************** GetOldestXmin(bool allDbs, bool ignoreVa
*** 1144,1150 ****
   * not statically allocated (see xip allocation below).
   */
  Snapshot
! GetSnapshotData(Snapshot snapshot)
  {
  	ProcArrayStruct *arrayP = procArray;
  	TransactionId xmin;
--- 1139,1145 ----
   * not statically allocated (see xip allocation below).
   */
  Snapshot
! GetSnapshotData(Snapshot snapshot, Snapshot stemplate)
  {
  	ProcArrayStruct *arrayP = procArray;
  	TransactionId xmin;
*************** GetSnapshotData(Snapshot snapshot)
*** 1158,1163 ****
--- 1153,1181 ----
  	Assert(snapshot != NULL);
  
  	/*
+ 	 * We only get a valid snapshot in stemplate if the snapshot
+ 	 * synchronization feature used. In that case we just need to copy the
+ 	 * values that we get onto the snapshot we return.
+ 	 * Note that in this case we always duplicate an existing snapshot, that is
+ 	 * currently held by another active transaction. That's why we do not need
+ 	 * to update any { RecentGlobalXmin, RecentXmin, globalxmin }.
+ 	 */
+ 	if (stemplate != InvalidSnapshot)
+ 	{
+ 		/*
+ 		 * 'stemplate' is only read and its values are copied onto 'snapshot'.
+ 		 */
+ 		CopySnapshotOnto(stemplate, snapshot);
+ 
+ 		/*
+ 		 * We can use the result of the copy except for that this snapshot
+ 		 * should look like new and not copied.
+ 		 */
+ 		snapshot->copied = false;
+ 		return snapshot;
+ 	}
+ 
+ 	/*
  	 * Allocating space for maxProcs xids is usually overkill; numProcs would
  	 * be sufficient.  But it seems better to do the malloc while not holding
  	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d39f897..462cc59 100644
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
*************** static void OldSerXidSetActiveSerXmin(Tr
*** 416,423 ****
  
  static uint32 predicatelock_hash(const void *key, Size keysize);
  static void SummarizeOldestCommittedSxact(void);
! static Snapshot GetSafeSnapshot(Snapshot snapshot);
! static Snapshot GetSerializableTransactionSnapshotInt(Snapshot snapshot);
  static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
  static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
  						  PREDICATELOCKTARGETTAG *parent);
--- 416,423 ----
  
  static uint32 predicatelock_hash(const void *key, Size keysize);
  static void SummarizeOldestCommittedSxact(void);
! static Snapshot GetSafeSnapshot(Snapshot snapshot, Snapshot stemplate);
! static Snapshot GetSerializableTransactionSnapshotInt(Snapshot snapshot, Snapshot stemplate);
  static bool PredicateLockExists(const PREDICATELOCKTARGETTAG *targettag);
  static bool GetParentPredicateLockTag(const PREDICATELOCKTARGETTAG *tag,
  						  PREDICATELOCKTARGETTAG *parent);
*************** SummarizeOldestCommittedSxact(void)
*** 1491,1497 ****
   *		area that can safely be passed to GetSnapshotData.
   */
  static Snapshot
! GetSafeSnapshot(Snapshot origSnapshot)
  {
  	Snapshot	snapshot;
  
--- 1491,1497 ----
   *		area that can safely be passed to GetSnapshotData.
   */
  static Snapshot
! GetSafeSnapshot(Snapshot origSnapshot, Snapshot stemplate)
  {
  	Snapshot	snapshot;
  
*************** GetSafeSnapshot(Snapshot origSnapshot)
*** 1505,1511 ****
  		 * our caller passed to us.  The pointer returned is actually the same
  		 * one passed to it, but we avoid assuming that here.
  		 */
! 		snapshot = GetSerializableTransactionSnapshotInt(origSnapshot);
  
  		if (MySerializableXact == InvalidSerializableXact)
  			return snapshot;	/* no concurrent r/w xacts; it's safe */
--- 1505,1511 ----
  		 * our caller passed to us.  The pointer returned is actually the same
  		 * one passed to it, but we avoid assuming that here.
  		 */
! 		snapshot = GetSerializableTransactionSnapshotInt(origSnapshot, stemplate);
  
  		if (MySerializableXact == InvalidSerializableXact)
  			return snapshot;	/* no concurrent r/w xacts; it's safe */
*************** GetSafeSnapshot(Snapshot origSnapshot)
*** 1562,1568 ****
   * within this function.
   */
  Snapshot
! GetSerializableTransactionSnapshot(Snapshot snapshot)
  {
  	Assert(IsolationIsSerializable());
  
--- 1562,1568 ----
   * within this function.
   */
  Snapshot
! GetSerializableTransactionSnapshot(Snapshot snapshot, Snapshot stemplate)
  {
  	Assert(IsolationIsSerializable());
  
*************** GetSerializableTransactionSnapshot(Snaps
*** 1572,1584 ****
  	 * thereby avoid all SSI overhead once it's running.
  	 */
  	if (XactReadOnly && XactDeferrable)
! 		return GetSafeSnapshot(snapshot);
  
! 	return GetSerializableTransactionSnapshotInt(snapshot);
  }
  
  static Snapshot
! GetSerializableTransactionSnapshotInt(Snapshot snapshot)
  {
  	PGPROC	   *proc;
  	VirtualTransactionId vxid;
--- 1572,1584 ----
  	 * thereby avoid all SSI overhead once it's running.
  	 */
  	if (XactReadOnly && XactDeferrable)
! 		return GetSafeSnapshot(snapshot, stemplate);
  
! 	return GetSerializableTransactionSnapshotInt(snapshot, stemplate);
  }
  
  static Snapshot
! GetSerializableTransactionSnapshotInt(Snapshot snapshot, Snapshot stemplate)
  {
  	PGPROC	   *proc;
  	VirtualTransactionId vxid;
*************** GetSerializableTransactionSnapshotInt(Sn
*** 1616,1622 ****
  	} while (!sxact);
  
  	/* Get the snapshot */
! 	snapshot = GetSnapshotData(snapshot);
  
  	/*
  	 * If there are no serializable transactions which are not read-only, we
--- 1616,1622 ----
  	} while (!sxact);
  
  	/* Get the snapshot */
! 	snapshot = GetSnapshotData(snapshot, stemplate);
  
  	/*
  	 * If there are no serializable transactions which are not read-only, we
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a71729c..9ffe3a1 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 72,77 ****
--- 72,78 ----
  #include "utils/plancache.h"
  #include "utils/portal.h"
  #include "utils/ps_status.h"
+ #include "utils/snapmgr.h"
  #include "utils/tzparser.h"
  #include "utils/xml.h"
  
*************** ExecSetVariableStmt(VariableSetStmt *stm
*** 6094,6099 ****
--- 6095,6109 ----
  	switch (stmt->kind)
  	{
  		case VAR_SET_VALUE:
+ 			if (strcmp(stmt->name, "TRANSACTION SNAPSHOT") == 0)
+ 			{
+ 				if (!ImportSnapshot(ExtractSetVariableArgs(stmt)))
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 							 errmsg("could not import the requested snapshot")));
+ 				break;
+ 			}
+ 			/* fallthrough */
  		case VAR_SET_CURRENT:
  			set_config_option(stmt->name,
  							  ExtractSetVariableArgs(stmt),
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 518aaf1..d047ba1 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 33,44 ****
   */
  #include "postgres.h"
  
  #include "access/transam.h"
  #include "access/xact.h"
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
! #include "utils/memutils.h"
  #include "utils/memutils.h"
  #include "utils/snapmgr.h"
  #include "utils/tqual.h"
--- 33,50 ----
   */
  #include "postgres.h"
  
+ #include <sys/types.h>
+ #include <sys/stat.h>
+ #include <unistd.h>
+ 
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "miscadmin.h"
+ #include "storage/fd.h"
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
! #include "utils/builtins.h"
  #include "utils/memutils.h"
  #include "utils/snapmgr.h"
  #include "utils/tqual.h"
*************** static Snapshot CopySnapshot(Snapshot sn
*** 116,133 ****
  static void FreeSnapshot(Snapshot snapshot);
  static void SnapshotResetXmin(void);
  
  
  /*
!  * GetTransactionSnapshot
   *		Get the appropriate snapshot for a new query in a transaction.
   *
!  * Note that the return value may point at static storage that will be modified
!  * by future calls and by CommandCounterIncrement().  Callers should call
!  * RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
!  * used very long.
   */
! Snapshot
! GetTransactionSnapshot(void)
  {
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
--- 122,145 ----
  static void FreeSnapshot(Snapshot snapshot);
  static void SnapshotResetXmin(void);
  
+ /* What we need for exporting snapshots */
+ #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
+ #define XactExportFilePath(path, xid, num, suffix) \
+     snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%08X-%d%s", xid, num, suffix)
+ 
+ List *exportedSnapshots = NIL;
  
  /*
!  * GetTransactionSnapshotFromTemplate
   *		Get the appropriate snapshot for a new query in a transaction.
   *
!  * A template snapshot is passed for the synchronized snapshots feature.
!  * In that case we want to have a snapshot back that has the template's
!  * values. We just pass it along and the lower level functions take care
!  * of it.
   */
! static Snapshot
! GetTransactionSnapshotFromTemplate(Snapshot stemplate)
  {
  	/* First call in transaction? */
  	if (!FirstSnapshotSet)
*************** GetTransactionSnapshot(void)
*** 145,153 ****
  		{
  			/* First, create the snapshot in CurrentSnapshotData */
  			if (IsolationIsSerializable())
! 				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
  			else
! 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  			/* Make a saved copy */
  			CurrentSnapshot = CopySnapshot(CurrentSnapshot);
  			FirstXactSnapshot = CurrentSnapshot;
--- 157,166 ----
  		{
  			/* First, create the snapshot in CurrentSnapshotData */
  			if (IsolationIsSerializable())
! 				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData,
! 																	 stemplate);
  			else
! 				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, stemplate);
  			/* Make a saved copy */
  			CurrentSnapshot = CopySnapshot(CurrentSnapshot);
  			FirstXactSnapshot = CurrentSnapshot;
*************** GetTransactionSnapshot(void)
*** 156,162 ****
  			RegisteredSnapshots++;
  		}
  		else
! 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
--- 169,182 ----
  			RegisteredSnapshots++;
  		}
  		else
! 		{
! 			/*
! 			 * template is only used for the synchronized snapshot feature. Which in
! 			 * turn is only allowed for IsolationUsesXactSnapshot() == true transactions
! 			 */
! 			Assert(stemplate == InvalidSnapshot);
! 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, InvalidSnapshot);
! 		}
  
  		FirstSnapshotSet = true;
  		return CurrentSnapshot;
*************** GetTransactionSnapshot(void)
*** 165,176 ****
  	if (IsolationUsesXactSnapshot())
  		return CurrentSnapshot;
  
! 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  
  	return CurrentSnapshot;
  }
  
  /*
   * GetLatestSnapshot
   *		Get a snapshot that is up-to-date as of the current instant,
   *		even if we are executing in transaction-snapshot mode.
--- 185,217 ----
  	if (IsolationUsesXactSnapshot())
  		return CurrentSnapshot;
  
! 	/* see comment above */
! 	Assert(stemplate == InvalidSnapshot);
! 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData, InvalidSnapshot);
  
  	return CurrentSnapshot;
  }
  
  /*
+  * GetTransactionSnapshot
+  *		Get the appropriate snapshot for a new query in a transaction.
+  *
+  * This is the public interface for anything different than the snapshot
+  * synchronization feature.
+  *
+  * Note that the return value may point at static storage that will be modified
+  * by future calls and by CommandCounterIncrement().  Callers should call
+  * RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
+  * used very long.
+  */
+ Snapshot
+ GetTransactionSnapshot(void)
+ {
+ 	return GetTransactionSnapshotFromTemplate(InvalidSnapshot);
+ }
+ 
+ 
+ /*
   * GetLatestSnapshot
   *		Get a snapshot that is up-to-date as of the current instant,
   *		even if we are executing in transaction-snapshot mode.
*************** GetLatestSnapshot(void)
*** 182,188 ****
  	if (!FirstSnapshotSet)
  		elog(ERROR, "no snapshot has been set");
  
! 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
  
  	return SecondarySnapshot;
  }
--- 223,229 ----
  	if (!FirstSnapshotSet)
  		elog(ERROR, "no snapshot has been set");
  
! 	SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData, InvalidSnapshot);
  
  	return SecondarySnapshot;
  }
*************** SnapshotSetCommandId(CommandId curcid)
*** 204,246 ****
  }
  
  /*
!  * CopySnapshot
!  *		Copy the given snapshot.
   *
!  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
!  * to 0.  The returned snapshot has the copied flag set.
   */
! static Snapshot
! CopySnapshot(Snapshot snapshot)
  {
- 	Snapshot	newsnap;
  	Size		subxipoff;
- 	Size		size;
- 
- 	Assert(snapshot != InvalidSnapshot);
- 
- 	/* We allocate any XID arrays needed in the same palloc block. */
- 	size = subxipoff = sizeof(SnapshotData) +
- 		snapshot->xcnt * sizeof(TransactionId);
- 	if (snapshot->subxcnt > 0)
- 		size += snapshot->subxcnt * sizeof(TransactionId);
  
! 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
! 	memcpy(newsnap, snapshot, sizeof(SnapshotData));
  
! 	newsnap->regd_count = 0;
! 	newsnap->active_count = 0;
! 	newsnap->copied = true;
  
  	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
! 		newsnap->xip = (TransactionId *) (newsnap + 1);
! 		memcpy(newsnap->xip, snapshot->xip,
  			   snapshot->xcnt * sizeof(TransactionId));
  	}
  	else
! 		newsnap->xip = NULL;
  
  	/*
  	 * Setup subXID array. Don't bother to copy it if it had overflowed,
--- 245,277 ----
  }
  
  /*
!  * CopySnapshotOnto
!  *      Copy the given snapshot onto an already sufficiently allocated other
!  *      snapshot.
   *
!  * Return the modified snapshot (onto).
   */
! Snapshot
! CopySnapshotOnto(Snapshot snapshot, Snapshot onto)
  {
  	Size		subxipoff;
  
! 	subxipoff = sizeof(SnapshotData) + snapshot->xcnt * sizeof(TransactionId);
! 	memcpy(onto, snapshot, sizeof(SnapshotData));
  
! 	onto->regd_count = 0;
! 	onto->active_count = 0;
! 	onto->copied = true;
  
  	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
! 		onto->xip = (TransactionId *) (onto + 1);
! 		memcpy(onto->xip, snapshot->xip,
  			   snapshot->xcnt * sizeof(TransactionId));
  	}
  	else
! 		onto->xip = NULL;
  
  	/*
  	 * Setup subXID array. Don't bother to copy it if it had overflowed,
*************** CopySnapshot(Snapshot snapshot)
*** 251,264 ****
  	if (snapshot->subxcnt > 0 &&
  		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
  	{
! 		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
! 		memcpy(newsnap->subxip, snapshot->subxip,
  			   snapshot->subxcnt * sizeof(TransactionId));
  	}
  	else
! 		newsnap->subxip = NULL;
  
! 	return newsnap;
  }
  
  /*
--- 282,320 ----
  	if (snapshot->subxcnt > 0 &&
  		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
  	{
! 		onto->subxip = (TransactionId *) ((char *) onto + subxipoff);
! 		memcpy(onto->subxip, snapshot->subxip,
  			   snapshot->subxcnt * sizeof(TransactionId));
  	}
  	else
! 		onto->subxip = NULL;
  
! 	return onto;
! }
! 
! /*
!  * CopySnapshot
!  *		Copy the given snapshot.
!  *
!  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
!  * to 0.  The returned snapshot has the copied flag set.
!  */
! static Snapshot
! CopySnapshot(Snapshot snapshot)
! {
! 	Snapshot	newsnap;
! 	Size		size;
! 
! 	Assert(snapshot != InvalidSnapshot);
! 
! 	/* We allocate any XID arrays needed in the same palloc block. */
! 	size = sizeof(SnapshotData) + snapshot->xcnt * sizeof(TransactionId);
! 	if (snapshot->subxcnt > 0)
! 		size += snapshot->subxcnt * sizeof(TransactionId);
! 
! 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
! 
! 	return CopySnapshotOnto(snapshot, newsnap);
  }
  
  /*
*************** AtEOXact_Snapshot(bool isCommit)
*** 586,588 ****
--- 642,1039 ----
  
  	SnapshotResetXmin();
  }
+ 
+ /*
+  * PreCommit_Snapshot
+  *		Cleans up exported snapshots (this needs to happen before we update
+  *		our MyProc entry, hence it is in PreCommit).
+  */
+ void
+ InvalidateExportedSnapshots(void)
+ {
+ 	ListCell   *snapshot;
+ 	int			i;
+ 	char		buf[MAXPGPATH];
+ 
+ 	if (exportedSnapshots == NIL)
+ 		return;
+ 
+ 	Assert(list_length(exportedSnapshots) > 0);
+ 	Assert(TransactionIdIsValid(GetTopTransactionIdIfAny()));
+ 
+ 	for(i = 1; i <= list_length(exportedSnapshots); i++)
+ 	{
+ 		XactExportFilePath(buf, GetTopTransactionId(), i, "");
+ 		unlink(buf);
+ 	}
+ 
+ 	foreach(snapshot, exportedSnapshots)
+ 		UnregisterSnapshotFromOwner(lfirst(snapshot), TopTransactionResourceOwner);
+ 
+ 	exportedSnapshots = NIL;
+ }
+ 
+ /*
+  * DeleteAllExportedSnapshotFiles
+  *		Cleans up any files that have been left behind by a crashed backend
+  *		that had exported snapshots before it died.
+  */
+ void
+ DeleteAllExportedSnapshotFiles(void)
+ {
+ 	char		buf[MAXPGPATH];
+ 	DIR		   *s_dir;
+ 	struct dirent *s_de;
+ 
+ 	if (!(s_dir = AllocateDir(SNAPSHOT_EXPORT_DIR)))
+ 	{
+ 		/*
+ 		 * We really should have that directory in a sane cluster setup. But
+ 		 * then again if we don't it's not fatal enough to make it FATAL.
+ 		 */
+ 		elog(WARNING,
+ 			"could not open directory \"%s\": %m",
+ 			SNAPSHOT_EXPORT_DIR);
+ 		return;
+ 	}
+ 
+ 	while ((s_de = ReadDir(s_dir, SNAPSHOT_EXPORT_DIR)) != NULL)
+ 	{
+ 		if (strcmp(s_de->d_name, ".") == 0 ||
+ 			strcmp(s_de->d_name, "..") == 0)
+ 			continue;
+ 
+ 		snprintf(buf, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", s_de->d_name);
+ 		unlink(buf);
+ 	}
+ 	FreeDir(s_dir);
+ }
+ 
+ /*
+  * ExportSnapshot
+  *		Export the snapshot to a file so that other backends can import the same
+  *		snapshot.
+  *		Returns the token (the file name) that can be used to import this
+  *		snapshot.
+  */
+ static char *
+ ExportSnapshot(Snapshot snapshot)
+ {
+ #define SNAPSHOT_APPEND(x, y) (appendStringInfo(&buf, (x), (y)))
+ 	TransactionId *children, topXid;
+ 	FILE	   *f;
+ 	int			i;
+ 	int			nchildren;
+ 	MemoryContext oldcxt;
+ 	char		path[MAXPGPATH];
+ 	char		pathtmp[MAXPGPATH];
+ 	StringInfoData buf;
+ 
+ 	Assert(IsTransactionState());
+ 
+ 	/*
+ 	 * This will also assign a transaction id if we do not yet have one.
+ 	 */
+ 	topXid = GetTopTransactionId();
+ 
+ 	Assert(TransactionIdIsValid(GetTopTransactionIdIfAny()));
+ 
+ 	/*
+ 	 * We cannot export a snapshot from a subtransaction because in a
+ 	 * subtransaction we don't see our open subxip values in the snapshot so
+ 	 * they would be missing in the backend applying it.
+ 	 */
+ 	if (GetCurrentTransactionNestLevel() != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ 				 errmsg("cannot export a snapshot from a subtransaction")));
+ 
+ 	/*
+ 	 * We do however see our already committed subxip values and add them to
+ 	 * the subxip array.
+ 	 */
+ 	nchildren = xactGetCommittedChildren(&children);
+ 
+ 	initStringInfo(&buf);
+ 
+ 	/* Write up all the data that we return */
+ 	SNAPSHOT_APPEND("xid:%d ", topXid);
+ 	SNAPSHOT_APPEND("xmi:%d ", snapshot->xmin);
+ 	SNAPSHOT_APPEND("xma:%d ", snapshot->xmax);
+ 	/* Include our own transaction ID into the count. */
+ 	SNAPSHOT_APPEND("xcnt:%d ", snapshot->xcnt + 1);
+ 	for (i = 0; i < snapshot->xcnt; i++)
+ 		SNAPSHOT_APPEND("xip:%d ", snapshot->xip[i]);
+ 	/*
+ 	 * Finally add our own XID, since by definition we will still be running
+ 	 * when the other transaction takes over the snapshot.
+ 	 */
+ 	SNAPSHOT_APPEND("xip:%d ", topXid);
+ 	if (snapshot->suboverflowed || snapshot->subxcnt + nchildren > TOTAL_MAX_CACHED_SUBXIDS)
+ 		SNAPSHOT_APPEND("sof:%d ", 1);
+ 	else
+ 	{
+ 		SNAPSHOT_APPEND("sxcnt:%d ", snapshot->subxcnt + nchildren);
+ 		for (i = 0; i < snapshot->subxcnt; i++)
+ 			SNAPSHOT_APPEND("sxp:%d ", snapshot->subxip[i]);
+ 		/* Add already committed subtransactions. */
+ 		for (i = 0; i < nchildren; i++)
+ 			SNAPSHOT_APPEND("sxp:%d ", children[i]);
+ 	}
+ 
+ 	/*
+ 	 * buf ends with a trailing space but we leave it in for simplicity. The
+ 	 * parsing routines also depend on it.
+ 	 */
+ 
+ 	/* Register the snapshot and add it to the list of exported snapshots */
+ 	snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
+ 
+ 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+ 	exportedSnapshots = lappend(exportedSnapshots, snapshot);
+ 	MemoryContextSwitchTo(oldcxt);
+ 
+ 	XactExportFilePath(pathtmp, topXid, list_length(exportedSnapshots), ".tmp");
+ 	if (!(f = AllocateFile(pathtmp, PG_BINARY_W)))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not create file \"%s\": %m", pathtmp)));
+ 
+ 	if (fwrite(buf.data, buf.len, 1, f) != 1)
+ 		/* Aborting the transaction will also call FreeFile() */
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", pathtmp)));
+ 
+ 	if (FreeFile(f))
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", pathtmp)));
+ 
+ 	/*
+ 	 * Now that we have written everything into a .tmp file we rename the file
+ 	 * and remove the .tmp suffix. Our filename is predictable and we're
+ 	 * paranoid enough to not let us read a partially written file (we can't
+ 	 * read a .tmp file because this would fail the valid characters check in
+ 	 * ImportSnapshot).
+ 	 */
+ 	XactExportFilePath(path, topXid, list_length(exportedSnapshots), "");
+ 
+ 	if (rename(pathtmp, path) < 0)
+ 		ereport(WARNING,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 						pathtmp, path)));
+ 
+ 	/*
+ 	 * The basename of the file is what we return from pg_export_snapshot().
+ 	 * It's already in path in a textual format and we know that the path
+ 	 * starts with SNAPSHOT_EXPORT_DIR. Skip over the prefix and over the
+ 	 * slash and pstrdup it to not return a local variable.
+ 	 */
+ 	return pstrdup(path + strlen(SNAPSHOT_EXPORT_DIR) + 1);
+ #undef SNAPSHOT_APPEND
+ }
+ 
+ /*
+  * Poor man's type independent parser. We only use it in the three functions
+  * below so there's no need to get ambitious about putting extra (x) around the
+  * arguments.
+  */
+ #define SNAPSHOT_PARSE(valFunc, inFunc, type, strpp, prfx, notfound)		\
+ 	do {																	\
+ 		char	   *n, *p = strstr(*strpp, prfx);							\
+ 		type		v;														\
+ 																			\
+ 		if (!p)																\
+ 			return notfound;												\
+ 		p += strlen(prfx);													\
+ 		n = strchr(p, ' ');													\
+ 		if (!n)																\
+ 			return notfound;												\
+ 		*n = '\0';															\
+ 		v = valFunc(DirectFunctionCall1(inFunc, CStringGetDatum(p)));		\
+ 		*strpp = n + 1;														\
+ 		return v;															\
+ 	} while (0);
+ 
+ static int
+ parseIntFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetInt32, int4in, int, s, prefix, 0);
+ }
+ 
+ static bool
+ parseBoolFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetInt32, int4in, bool, s, prefix, false);
+ }
+ 
+ static TransactionId
+ parseXactFromText(char **s, const char *prefix)
+ {
+ 	SNAPSHOT_PARSE(DatumGetTransactionId, xidin, TransactionId,
+ 				   s, prefix, InvalidTransactionId);
+ }
+ 
+ #undef SNAPSHOT_PARSE
+ 
+ /*
+  * ImportSnapshot
+  *      Import a previously exported snapshot. We expect that whatever we get
+  *      is a filename in SNAPSHOT_EXPORT_DIR. Load the snapshot from that file.
+  *      This is called from "SET TRANSACTION SNAPSHOT 'foo'" and we always
+  *      start fresh from zero with respect to the transaction state that we
+  *      work on. Returns true on success and false on failure.
+  */
+ bool
+ ImportSnapshot(char *idstr)
+ {
+ 	char		path[MAXPGPATH];
+ 	FILE	   *f;
+ 	int			i;
+ 	char	   *s;
+ 	struct stat	stat_buf;
+ 	int			sxcnt, xcnt;
+ 	TransactionId xid, origXid, myXid;
+ 	SnapshotData snapshot = {HeapTupleSatisfiesMVCC};
+ 
+ 	if (FirstSnapshotSet || GetCurrentTransactionNestLevel() != 1)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ 			 errmsg("SET TRANSACTION SNAPSHOT must be called before any query")));
+ 
+ 	/*
+ 	 * If we were in read committed mode then the next query would execute with a
+ 	 * new snapshot thus making this function call quite useless.
+ 	 */
+ 	if (!IsolationUsesXactSnapshot())
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 			 errmsg("A snapshot importing transaction must have ISOLATION "
+ 					"LEVEL SERIALIZABLE or ISOLATION LEVEL REPEATABLE READ")));
+ 
+ 	/* We're lucky to always start off from a pretty clean state */
+ 	Assert(IsTransactionState());
+ 	Assert(GetCurrentTransactionNestLevel() == 1);
+ 	Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+ 	Assert(CurrentSnapshot == NULL);
+ 	Assert(SecondarySnapshot == NULL);
+ 	Assert(RegisteredSnapshots == 0);
+ 
+ 	/* verify the identifier, only 0-9,A-F and a hyphen are allowed... */
+ 	s = idstr;
+ 	while (*s)
+ 	{
+ 		if (!isdigit(*s) && !(*s >= 'A' && *s <= 'F') && *s != '-')
+ 			ereport(ERROR,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("could not import the requested snapshot"),
+ 				 errhint("the given snapshot identifier contains invalid characters")));
+ 		s++;
+ 	}
+ 
+ 	/*
+ 	 * Assign a transaction id. We only do this to detect a possible
+ 	 * transaction id wraparound which is somewhere between unlikely
+ 	 * and impossible...
+ 	 */
+ 	myXid = GetTopTransactionId();
+ 
+ 	snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", idstr);
+ 
+ 	/* get the size of the file so that we know how much memory we need */
+ 	if (stat(path, &stat_buf) != 0 || !(f = AllocateFile(path, PG_BINARY_R)))
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("could not import the requested snapshot"),
+ 			 errhint("snapshot has not been exported or does not exist anymore")));
+ 
+ 	s = palloc(stat_buf.st_size + 1);
+ 	if (fread(s, stat_buf.st_size, 1, f) != 1)
+ 		return false;
+ 
+ 	s[stat_buf.st_size] = '\0';
+ 
+ 	FreeFile(f);
+ 
+ 	origXid = parseXactFromText(&s, "xid:");
+ 
+ 	snapshot.xmin = parseXactFromText(&s, "xmi:");
+ 	Assert(snapshot.xmin != InvalidTransactionId);
+ 	snapshot.xmax = parseXactFromText(&s, "xma:");
+ 	Assert(snapshot.xmax != InvalidTransactionId);
+ 
+ 	xcnt = parseIntFromText(&s, "xcnt:");
+ 	/*
+ 	 * This snapshot only serves as a template, there is no need for it to have
+ 	 * maxProcs entries, so let's make it just as large as we need it.
+ 	 */
+ 	snapshot.xip = palloc(xcnt * sizeof(TransactionId));
+ 
+ 	i = 0;
+ 	while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
+ 		snapshot.xip[i++] = xid;
+ 	snapshot.xcnt = i;
+ 	Assert(snapshot.xcnt == xcnt);
+ 
+ 	/*
+ 	 * We only write "sof:1" if the snapshot overflowed. If not, then there is
+ 	 * no "sof:x" entry at all and parseBoolFromText() will return false.
+ 	 */
+ 	snapshot.suboverflowed = parseBoolFromText(&s, "sof:");
+ 
+ 	if (!snapshot.suboverflowed)
+ 	{
+ 		sxcnt = parseIntFromText(&s, "sxcnt:");
+ 		snapshot.subxip = palloc(sxcnt * sizeof(TransactionId));
+ 
+ 		i = 0;
+ 		while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
+ 			snapshot.subxip[i++] = xid;
+ 		snapshot.subxcnt = i;
+ 		Assert(snapshot.subxcnt == sxcnt);
+ 	} else {
+ 		snapshot.subxip = NULL;
+ 		snapshot.subxcnt = 0;
+ 	}
+ 
+ 	/* complete the snapshot data structure */
+ 	snapshot.curcid = 0;
+ 	snapshot.takenDuringRecovery = RecoveryInProgress();
+ 
+ 	/*
+ 	 * Note that MyProc->xmin can go backwards here. However this is safe
+ 	 * because the xmin we set here is the same as in the backend's proc->xmin
+ 	 * whose snapshot we are copying. At this very moment, anybody computing a
+ 	 * minimum will calculate at least this xmin as the overall xmin with or
+ 	 * without us setting MyProc->xmin to this value.
+ 	 */
+ 	LWLockAcquire(ProcArrayLock, LW_SHARED);
+ 	MyProc->xmin = snapshot.xmin;
+ 	LWLockRelease(ProcArrayLock);
+ 
+ 	/* bail out if the original transaction is not running anymore... */
+ 	if (!TransactionIdIsInProgress(origXid) || TransactionIdPrecedes(myXid, origXid))
+ 		return false;
+ 
+ 	/*
+ 	 * Install the snapshot as if we got it through GetTransactionSnapshot().
+ 	 * This will set up CurrentSnapshot and also set up the predicate locks for a
+ 	 * serializable transaction.
+ 	 */
+ 	GetTransactionSnapshotFromTemplate(&snapshot);
+ 	return true;
+ }
+ 
+ Datum
+ pg_export_snapshot(PG_FUNCTION_ARGS)
+ {
+ 	char	   *snapshotData;
+ 
+ 	RequireTransactionChain(true, "pg_export_snapshot()");
+ 
+ 	snapshotData = ExportSnapshot(GetTransactionSnapshot());
+ 	PG_RETURN_TEXT_P(cstring_to_text(snapshotData));
+ }
+ 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e535fda..0e981cd 100644
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
*************** main(int argc, char *argv[])
*** 2557,2562 ****
--- 2557,2563 ----
  		"pg_serial",
  		"pg_subtrans",
  		"pg_twophase",
+ 		"pg_snapshots",
  		"pg_multixact/members",
  		"pg_multixact/offsets",
  		"base",
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 0019df5..4706f10 100644
*** a/src/include/access/twophase.h
--- b/src/include/access/twophase.h
***************
*** 22,30 ****
   */
  typedef struct GlobalTransactionData *GlobalTransaction;
  
- /* GUC variable */
- extern int	max_prepared_xacts;
- 
  extern Size TwoPhaseShmemSize(void);
  extern void TwoPhaseShmemInit(void);
  
--- 22,27 ----
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 96f43fe..a4e0387 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 2853,2858 ****
--- 2853,2860 ----
  DESCR("cancel a server process' current query");
  DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
  DESCR("terminate a server process");
+ DATA(insert OID = 3122 ( pg_export_snapshot		PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
+ DESCR("export a snapshot");
  DATA(insert OID = 2172 ( pg_start_backup		PGNSP PGUID 12 1 0 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
  DESCR("prepare for taking an online backup");
  DATA(insert OID = 2173 ( pg_stop_backup			PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9d19417..15326cf 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*************** extern PGDLLIMPORT int NBuffers;
*** 134,139 ****
--- 134,142 ----
  extern int	MaxBackends;
  extern int	MaxConnections;
  
+ /* GUC variable */
+ extern int	max_prepared_xacts;
+ 
  extern PGDLLIMPORT int MyProcPid;
  extern PGDLLIMPORT pg_time_t MyStartTime;
  extern PGDLLIMPORT struct Port *MyProcPort;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 12c2faf..3d170bc 100644
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
*************** PG_KEYWORD("show", SHOW, UNRESERVED_KEYW
*** 337,342 ****
--- 337,343 ----
  PG_KEYWORD("similar", SIMILAR, TYPE_FUNC_NAME_KEYWORD)
  PG_KEYWORD("simple", SIMPLE, UNRESERVED_KEYWORD)
  PG_KEYWORD("smallint", SMALLINT, COL_NAME_KEYWORD)
+ PG_KEYWORD("snapshot", SNAPSHOT, UNRESERVED_KEYWORD)
  PG_KEYWORD("some", SOME, RESERVED_KEYWORD)
  PG_KEYWORD("stable", STABLE, UNRESERVED_KEYWORD)
  PG_KEYWORD("standalone", STANDALONE_P, UNRESERVED_KEYWORD)
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index 9603b10..50b046e 100644
*** a/src/include/storage/predicate.h
--- b/src/include/storage/predicate.h
*************** extern void CheckPointPredicate(void);
*** 42,48 ****
  extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
  
  /* predicate lock maintenance */
! extern Snapshot GetSerializableTransactionSnapshot(Snapshot snapshot);
  extern void RegisterPredicateLockingXid(TransactionId xid);
  extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
  extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
--- 42,48 ----
  extern bool PageIsPredicateLocked(Relation relation, BlockNumber blkno);
  
  /* predicate lock maintenance */
! extern Snapshot GetSerializableTransactionSnapshot(Snapshot snapshot, Snapshot stemplate);
  extern void RegisterPredicateLockingXid(TransactionId xid);
  extern void PredicateLockRelation(Relation relation, Snapshot snapshot);
  extern void PredicateLockPage(Relation relation, BlockNumber blkno, Snapshot snapshot);
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a11d438..dfe57aa 100644
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
*************** extern void ExpireOldKnownAssignedTransa
*** 39,45 ****
  
  extern RunningTransactions GetRunningTransactionData(void);
  
! extern Snapshot GetSnapshotData(Snapshot snapshot);
  
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
--- 39,45 ----
  
  extern RunningTransactions GetRunningTransactionData(void);
  
! extern Snapshot GetSnapshotData(Snapshot snapshot, Snapshot stemplate);
  
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
*************** extern void XidCacheRemoveRunningXids(Tr
*** 69,72 ****
--- 69,78 ----
  						  int nxids, const TransactionId *xids,
  						  TransactionId latestXid);
  
+ /* Size of the ProcArray structure itself */
+ #define PROCARRAY_MAXPROCS	(MaxBackends + max_prepared_xacts)
+ 
+ #define TOTAL_MAX_CACHED_SUBXIDS \
+ 	((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+ 
  #endif   /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index e665a28..a26e2fe 100644
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
*************** extern TransactionId TransactionXmin;
*** 22,27 ****
--- 22,29 ----
  extern TransactionId RecentXmin;
  extern TransactionId RecentGlobalXmin;
  
+ extern List *exportedSnapshots;
+ 
  extern Snapshot GetTransactionSnapshot(void);
  extern Snapshot GetLatestSnapshot(void);
  extern void SnapshotSetCommandId(CommandId curcid);
*************** extern void UpdateActiveSnapshotCommandI
*** 32,37 ****
--- 34,40 ----
  extern void PopActiveSnapshot(void);
  extern Snapshot GetActiveSnapshot(void);
  extern bool ActiveSnapshotSet(void);
+ extern Snapshot CopySnapshotOnto(Snapshot onto, Snapshot snapshot);
  
  extern Snapshot RegisterSnapshot(Snapshot snapshot);
  extern void UnregisterSnapshot(Snapshot snapshot);
*************** extern void AtSubCommit_Snapshot(int lev
*** 42,45 ****
--- 45,53 ----
  extern void AtSubAbort_Snapshot(int level);
  extern void AtEOXact_Snapshot(bool isCommit);
  
+ extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
+ extern bool ImportSnapshot(char *idstr);
+ extern void InvalidateExportedSnapshots(void);
+ extern void DeleteAllExportedSnapshotFiles(void);
+ 
  #endif   /* SNAPMGR_H */

#34

Marko Tiikkaja

marko.tiikkaja@2ndquadrant.com

over 14 years ago

In reply to: Joachim Wieland (#33)

Re: synchronized snapshots

On 2011-09-28 15:25, Joachim Wieland wrote:

Yes, that's the desired behaviour, the patch add this paragraph to the
documentation already:

I can't believe I missed that. My apologies.

On 2011-09-29 05:16, Joachim Wieland wrote:

The attached patch addresses the reported issues.

Thanks, this one looks good to me. Going to mark this patch as ready
for committer.

--
Marko Tiikkaja http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#35

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Marko Tiikkaja (#34)

Re: synchronized snapshots

On Mon, Oct 3, 2011 at 7:09 AM, Marko Tiikkaja
<marko.tiikkaja@2ndquadrant.com> wrote:

On 2011-09-28 15:25, Joachim Wieland wrote:

Yes, that's the desired behaviour, the patch add this paragraph to the
documentation already:

I can't believe I missed that. My apologies.

On 2011-09-29 05:16, Joachim Wieland wrote:

The attached patch addresses the reported issues.

Thanks, this one looks good to me. Going to mark this patch as ready for
committer.

I don't see any tests with this patch, so I personally won't be the
committer on this just yet.

Also, not sure why the snapshot id syntax has leading zeroes on first
part of the number, but not on second part. It will still sort
incorrectly if that's what we were trying to achieve. Either leave off
the leading zeroes on first part of add them to second.

We probably need some more discussion added to the README about this.

I'm also concerned that we are adding this to the BEGIN statement as
the only option. I don't have a problem with it being there, but I do
think it is a problem to make it the *only* way to use this. Altering
BEGIN gives clear problems with any API that does the begin and commit
for you, such as perl DBI, java JDBC to name just two. I can't really
see its a good implementation if we say this won't work until client
APIs follow our new non-standard syntax. I wouldn't block commit on
this point, but I think we should work on alternative ways to invoke
this feature as well.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#36

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Simon Riggs (#35)

Re: synchronized snapshots

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Oct 3, 2011 at 7:09 AM, Marko Tiikkaja

Thanks, this one looks good to me. �Going to mark this patch as ready for
committer.

I don't see any tests with this patch, so I personally won't be the
committer on this just yet.

I've already taken it according to the commitfest app. There's a lot of
things I don't like stylistically, but they all seem fixable, and I'm
working through them now. The only actual bug I've found so far is a
race condition while setting MyProc->xmin (you can't do that separately
from verifying that the source transaction is still running, else
somebody else could see a global xmin that's gone backwards).

Also, not sure why the snapshot id syntax has leading zeroes on first
part of the number, but not on second part. It will still sort
incorrectly if that's what we were trying to achieve. Either leave off
the leading zeroes on first part of add them to second.

The first part is of fixed length, the second not so much. I'm not
wedded to the syntax but I don't see anything wrong with it either.

I'm also concerned that we are adding this to the BEGIN statement as
the only option.

Huh? The last version of the patch has it only as SET TRANSACTION
SNAPSHOT, which I think is the right way.

regards, tom lane

#37

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Tom Lane (#36)

Re: synchronized snapshots

On Tue, Oct 18, 2011 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm also concerned that we are adding this to the BEGIN statement as
the only option.

Huh? The last version of the patch has it only as SET TRANSACTION
SNAPSHOT, which I think is the right way.

Sorry Tom, didn't see your name on it earlier, thats not shown on the
main CF display and I didn't think to check on the detail. Please
continue.

I misread the SET TRANSACTION docs changes. Happy with that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#38

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Joachim Wieland (#33)

Re: synchronized snapshots

Joachim Wieland <joe@mcknight.de> writes:

[ synchronized-snapshots patch ]

Looking through this code, it strikes me that SET TRANSACTION SNAPSHOT
is fundamentally incompatible with SERIALIZABLE READ ONLY DEFERRABLE
mode. That mode assumes that you should be able to just take a new
snapshot, repeatedly, until you get one that's "safe". With the patch
as written, if the supplied snapshot is "unsafe", GetSafeSnapshot()
will just go into an infinite loop.

AFAICS we should just throw an error if SET TRANSACTION SNAPSHOT is done
in a transaction with those properties. Has anyone got another
interpretation? Would it be better to silently ignore the DEFERRABLE
property?

regards, tom lane

#39

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 14 years ago

In reply to: Tom Lane (#38)

Re: synchronized snapshots

On 19.10.2011 19:17, Tom Lane wrote:

Joachim Wieland<joe@mcknight.de> writes:

[ synchronized-snapshots patch ]

Looking through this code, it strikes me that SET TRANSACTION SNAPSHOT
is fundamentally incompatible with SERIALIZABLE READ ONLY DEFERRABLE
mode. That mode assumes that you should be able to just take a new
snapshot, repeatedly, until you get one that's "safe". With the patch
as written, if the supplied snapshot is "unsafe", GetSafeSnapshot()
will just go into an infinite loop.

AFAICS we should just throw an error if SET TRANSACTION SNAPSHOT is done
in a transaction with those properties. Has anyone got another
interpretation? Would it be better to silently ignore the DEFERRABLE
property?

An error seems appropriate to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#40

Florian Pflug

fgp@phlo.org

about 14 years ago

In reply to: Tom Lane (#38)

Re: synchronized snapshots

On Oct19, 2011, at 18:17 , Tom Lane wrote:

Joachim Wieland <joe@mcknight.de> writes:

[ synchronized-snapshots patch ]

Looking through this code, it strikes me that SET TRANSACTION SNAPSHOT
is fundamentally incompatible with SERIALIZABLE READ ONLY DEFERRABLE
mode. That mode assumes that you should be able to just take a new
snapshot, repeatedly, until you get one that's "safe". With the patch
as written, if the supplied snapshot is "unsafe", GetSafeSnapshot()
will just go into an infinite loop.

AFAICS we should just throw an error if SET TRANSACTION SNAPSHOT is done
in a transaction with those properties. Has anyone got another
interpretation? Would it be better to silently ignore the DEFERRABLE
property?

Hm, both features are meant to be used by pg_dump, so think we should
make the combination work. It'd say SET TRANSACTION SNAPSHOT should throw
an error only if the transaction is marked READ ONLY DEFERRABLE *and*
the provided snapshot isn't "safe".

This allows a deferrable snapshot to be used on a second connection (
by e.g. pg_dump), and still be marked as DEFERRABLE. If we throw an
error unconditionally, the second connection has to import the snapshot
without marking it DEFERRABLE, which I think has consequences for
performance. AFAIR, the SERIALIZABLE implementation is able to skip
almost all (or all? Kevin?) SIREAD lock acquisitions in DEFERRABLE READ
ONLY transaction, because those cannot participate in dangerous (i.e.
non-serializable) dependency structures.

best regards,
Florian Pflug

#41

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Florian Pflug (#40)

Re: synchronized snapshots

Florian Pflug <fgp@phlo.org> writes:

On Oct19, 2011, at 18:17 , Tom Lane wrote:

AFAICS we should just throw an error if SET TRANSACTION SNAPSHOT is done
in a transaction with those properties. Has anyone got another
interpretation? Would it be better to silently ignore the DEFERRABLE
property?

Hm, both features are meant to be used by pg_dump, so think we should
make the combination work. It'd say SET TRANSACTION SNAPSHOT should throw
an error only if the transaction is marked READ ONLY DEFERRABLE *and*
the provided snapshot isn't "safe".

Um, no, I don't think so. It would be sensible for the "leader"
transaction to use READ ONLY DEFERRABLE and then export the snapshot it
got (possibly after waiting). It doesn't follow that the child
transactions should use DEFERRABLE too. They're not going to wait.

This allows a deferrable snapshot to be used on a second connection (
by e.g. pg_dump), and still be marked as DEFERRABLE. If we throw an
error unconditionally, the second connection has to import the snapshot
without marking it DEFERRABLE, which I think has consequences for
performance.

No, I don't believe that either. AIUI the performance benefit comes if
the snapshot is recognized as safe. DEFERRABLE only means to keep
retrying until you get a safe one. This is nonsense when you're
importing the snapshot.

regards, tom lane

#42

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Tom Lane (#41)

Re: synchronized snapshots

On Wed, Oct 19, 2011 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

On Oct19, 2011, at 18:17 , Tom Lane wrote:

AFAICS we should just throw an error if SET TRANSACTION SNAPSHOT is done
in a transaction with those properties. Has anyone got another
interpretation? Would it be better to silently ignore the DEFERRABLE
property?

Hm, both features are meant to be used by pg_dump, so think we should
make the combination work. It'd say SET TRANSACTION SNAPSHOT should throw
an error only if the transaction is marked READ ONLY DEFERRABLE *and*
the provided snapshot isn't "safe".

Um, no, I don't think so. It would be sensible for the "leader"
transaction to use READ ONLY DEFERRABLE and then export the snapshot it
got (possibly after waiting). It doesn't follow that the child
transactions should use DEFERRABLE too. They're not going to wait.

This allows a deferrable snapshot to be used on a second connection (
by e.g. pg_dump), and still be marked as DEFERRABLE. If we throw an
error unconditionally, the second connection has to import the snapshot
without marking it DEFERRABLE, which I think has consequences for
performance.

No, I don't believe that either. AIUI the performance benefit comes if
the snapshot is recognized as safe. DEFERRABLE only means to keep
retrying until you get a safe one. This is nonsense when you're
importing the snapshot.

I think the requirement is that we need to do the appropriate push-ups
so that the people who import the snapshot know that it's safe, and
that the SSI stuff can all be skipped.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 14 years ago

In reply to: Robert Haas (#42)

Re: synchronized snapshots

Robert Haas <robertmhaas@gmail.com> wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

This allows a deferrable snapshot to be used on a second
connection (by e.g. pg_dump), and still be marked as DEFERRABLE.
If we throw an error unconditionally, the second connection has
to import the snapshot without marking it DEFERRABLE, which I
think has consequences for performance.

No, I don't believe that either. AIUI the performance benefit
comes if the snapshot is recognized as safe. DEFERRABLE only
means to keep retrying until you get a safe one.

Right, there are other circumstances in which a READ ONLY
transaction's snapshot may be recognized as safe, and it can opt out
of all the additional SSI logic. As you say, DEFERRABLE means we
*wait* for that.

This is nonsense when you're importing the snapshot.

Agreed.

I think the requirement is that we need to do the appropriate
push-ups so that the people who import the snapshot know that it's
safe, and that the SSI stuff can all be skipped.

If the snapshot was safe in the first process, it will be safe for
any others with which it is shared. Basically, a SERIALIZABLE READ
ONLY DEFERRABLE transaction waits for a snapshot which, as a READ
ONLY transaction, can't see any serialization anomalies. It can run
exactly like a REPEATABLE READ transaction. In fact, it would be OK
from a functional perspective if the first transaction in pg_dump
got a safe snapshot through DEFERRABLE techniques and then shared it
with REPEATABLE READ transactions.

I don't know which is the best way to implement this, but it should
be fine to skip the DEFERRABLE logic for secondary users, as long as
they are READ ONLY.

-Kevin

#44

Florian Pflug

fgp@phlo.org

about 14 years ago

In reply to: Kevin Grittner (#43)

Re: synchronized snapshots

On Oct19, 2011, at 19:49 , Kevin Grittner wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

This allows a deferrable snapshot to be used on a second
connection (by e.g. pg_dump), and still be marked as DEFERRABLE.
If we throw an error unconditionally, the second connection has
to import the snapshot without marking it DEFERRABLE, which I
think has consequences for performance.

No, I don't believe that either. AIUI the performance benefit
comes if the snapshot is recognized as safe. DEFERRABLE only
means to keep retrying until you get a safe one.

Right, there are other circumstances in which a READ ONLY
transaction's snapshot may be recognized as safe, and it can opt out
of all the additional SSI logic. As you say, DEFERRABLE means we
*wait* for that.

Oh, cool. I thought the opt-out only works for explicitly DEFERRABLE
transactions.

best regards,
Florian Pflug

#45

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 14 years ago

In reply to: Florian Pflug (#44)

Re: synchronized snapshots

Florian Pflug <fgp@phlo.org> wrote:

Oh, cool. I thought the opt-out only works for explicitly
DEFERRABLE transactions.

Basically, if there is no serializable read-write transaction active
which overlaps the read-only transaction and also overlaps a
serializable transaction which wrote something and committed in time
to be visible to the read-only transaction, then the read-only
transaction's snapshot is "safe" and it can stop worrying about SSI
logic. If these conditions happen to exist when a read-only
transaction is starting, it never needs to set up for SSI; it can
run just like a REPEATABLE READ transaction and still be safe from
serialization anomalies. We make some effort to spot the transition
to this state while a read-only transaction is running, allowing it
to "drop out" of SSI while running.

The fact that a read-only transaction can often skip some or all of
the SSI overhead (beyond determining that opting out is safe) is why
declaring transactions to be READ ONLY when possible is #1 on my
list of performance considerations for SSI.

-Kevin

#46

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Joachim Wieland (#33)

Re: synchronized snapshots

Joachim Wieland <joe@mcknight.de> writes:

[ synchronized snapshots patch ]

Applied with, um, rather extensive editorialization.

I'm not convinced that the SSI case is bulletproof yet, but it'll be
easier to test with the code committed.

regards, tom lane

#47

Thom Brown

thom@linux.com

about 14 years ago

In reply to: Tom Lane (#46)

Re: synchronized snapshots

On 23 October 2011 00:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joachim Wieland <joe@mcknight.de> writes:

[ synchronized snapshots patch ]

Applied with, um, rather extensive editorialization.

I'm not convinced that the SSI case is bulletproof yet, but it'll be
easier to test with the code committed.

Can I ask why it doesn't return the same snapshot ID each time?
Surely it can't change since you can only export the snapshot of a
serializable or repeatable read transaction? A "SELECT
count(pg_export_snapshot()) FROM generate_series(1,10000000);" would
quickly bork the pg_snapshots directory which any user can run.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#48

Tom Lane

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Thom Brown (#47)

Re: synchronized snapshots

Thom Brown <thom@linux.com> writes:

Can I ask why it doesn't return the same snapshot ID each time?
Surely it can't change since you can only export the snapshot of a
serializable or repeatable read transaction?

No, that's incorrect. You can export from a READ COMMITTED transaction;
indeed, you'd more or less have to, if you want the control transaction
to be able to see what the slaves do.

A "SELECT
count(pg_export_snapshot()) FROM generate_series(1,10000000);" would
quickly bork the pg_snapshots directory which any user can run.

Shrug ... you can create a much more severe DOS problem by making
zillions of tables, if the filesystem doesn't handle lots-o-files
well.

regards, tom lane

#49

Thom Brown

thom@linux.com

about 14 years ago

In reply to: Tom Lane (#48)

Re: synchronized snapshots

On 23 October 2011 03:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thom Brown <thom@linux.com> writes:

Can I ask why it doesn't return the same snapshot ID each time?
Surely it can't change since you can only export the snapshot of a
serializable or repeatable read transaction?

No, that's incorrect. You can export from a READ COMMITTED transaction;
indeed, you'd more or less have to, if you want the control transaction
to be able to see what the slaves do.

My bad. I didn't read the documentation carefully enough. I can make
sense of it now.

Thanks

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company