RFC: Making TRUNCATE more "MVCC-safe"

Started by Marti Raudseppalmost 14 years ago37 messages
#1Marti Raudsepp
marti@juffo.org
1 attachment(s)

Hi!

I've always been a little wary of using the TRUNCATE command due to
the warning in the documentation about it not being "MVCC-safe":
queries may silently give wrong results and it's hard to tell when
they are affected.

That got me thinking, why can't we handle this like a standby server
does -- if some query has data removed from underneath it, it aborts
with a serialization failure.

Does this solution sound like a good idea?

The attached patch is a lame attempt at implementing this. I added a
new pg_class.relvalidxmin attribute which tracks the Xid of the last
TRUNCATE (maybe it should be called reltruncatexid?). Whenever
starting a relation scan with a snapshot older than relvalidxmin, an
error is thrown. This seems to work out well since TRUNCATE updates
pg_class anyway, and anyone who sees the new relfilenode automatically
knows when it was truncated.

Am I on the right track? Are there any better ways to attach this
information to a relation?
Should I also add another counter to pg_stat_database_conflicts?
Currently this table is only used on standby servers.

Since I wrote it just this afternoon, there are a few things still
wrong with the patch (it doesn't handle xid wraparound for one), so
don't be too picky about the code yet. :)

Example:
CREATE TABLE foo (i int);
Session A:
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current(); -- Force snapshot open
Session B:
TRUNCATE TABLE foo;
Session A:
SELECT * FROM foo;
ERROR: canceling statement due to conflict with TRUNCATE TABLE on foo
DETAIL: Rows visible to this transaction have been removed.

Patch also available in my github 'truncate' branch:
https://github.com/intgr/postgres/commits/truncate

Regards,
Marti

Attachments:

safe-truncate.patchtext/x-patch; charset=US-ASCII; name=safe-truncate.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
new file mode 100644
index 99a431a..e909b17
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*************** heap_beginscan_internal(Relation relatio
*** 1175,1180 ****
--- 1175,1196 ----
  	HeapScanDesc scan;
  
  	/*
+ 	 * If the snapshot is older than relvalidxmin then that means TRUNCATE has
+ 	 * removed data that we should still see. Abort rather than giving
+ 	 * potentially bogus results.
+ 	 */
+ 	if (relation->rd_rel->relvalidxmin != InvalidTransactionId &&
+ 		snapshot->xmax != InvalidTransactionId &&
+ 		NormalTransactionIdPrecedes(snapshot->xmax, relation->rd_rel->relvalidxmin))
+ 	{
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ 				 errmsg("canceling statement due to conflict with TRUNCATE TABLE on %s",
+ 						 NameStr(relation->rd_rel->relname)),
+ 				 errdetail("Rows visible to this transaction have been removed.")));
+ 	}
+ 
+ 	/*
  	 * increment relation ref count while scanning relation
  	 *
  	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
new file mode 100644
index aef410a..3f9bd5d
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
*************** InsertPgClassTuple(Relation pg_class_des
*** 787,792 ****
--- 787,793 ----
  	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel->relhastriggers);
  	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel->relhassubclass);
  	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel->relfrozenxid);
+ 	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel->relvalidxmin);
  	if (relacl != (Datum) 0)
  		values[Anum_pg_class_relacl - 1] = relacl;
  	else
*************** AddNewRelationTuple(Relation pg_class_de
*** 884,889 ****
--- 885,891 ----
  		new_rel_reltup->relfrozenxid = InvalidTransactionId;
  	}
  
+ 	new_rel_reltup->relvalidxmin = InvalidTransactionId;
  	new_rel_reltup->relowner = relowner;
  	new_rel_reltup->reltype = new_type_oid;
  	new_rel_reltup->reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
new file mode 100644
index bfbe642..0d96a6a
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
*************** reindex_index(Oid indexId, bool skip_con
*** 2854,2860 ****
  		}
  
  		/* We'll build a new physical relation for the index */
! 		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
  
  		/* Initialize the index and rebuild */
  		/* Note: we do not need to re-establish pkey setting */
--- 2854,2860 ----
  		}
  
  		/* We'll build a new physical relation for the index */
! 		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
  
  		/* Initialize the index and rebuild */
  		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
new file mode 100644
index 1f95abc..ab4aec2
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
*************** ResetSequence(Oid seq_relid)
*** 287,293 ****
  	 * Create a new storage file for the sequence.	We want to keep the
  	 * sequence's relfrozenxid at 0, since it won't contain any unfrozen XIDs.
  	 */
! 	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId);
  
  	/*
  	 * Insert the modified tuple into the new storage file.
--- 287,293 ----
  	 * Create a new storage file for the sequence.	We want to keep the
  	 * sequence's relfrozenxid at 0, since it won't contain any unfrozen XIDs.
  	 */
! 	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId, InvalidTransactionId);
  
  	/*
  	 * Insert the modified tuple into the new storage file.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
new file mode 100644
index 07dc326..993dc41
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 19,24 ****
--- 19,25 ----
  #include "access/reloptions.h"
  #include "access/relscan.h"
  #include "access/sysattr.h"
+ #include "access/transam.h"
  #include "access/xact.h"
  #include "catalog/catalog.h"
  #include "catalog/dependency.h"
*************** ExecuteTruncate(TruncateStmt *stmt)
*** 1118,1124 ****
  			 * as the relfilenode value. The old storage file is scheduled for
  			 * deletion at commit.
  			 */
! 			RelationSetNewRelfilenode(rel, RecentXmin);
  
  			heap_relid = RelationGetRelid(rel);
  			toast_relid = rel->rd_rel->reltoastrelid;
--- 1119,1125 ----
  			 * as the relfilenode value. The old storage file is scheduled for
  			 * deletion at commit.
  			 */
! 			RelationSetNewRelfilenode(rel, RecentXmin, GetCurrentTransactionId());
  
  			heap_relid = RelationGetRelid(rel);
  			toast_relid = rel->rd_rel->reltoastrelid;
*************** ExecuteTruncate(TruncateStmt *stmt)
*** 1129,1135 ****
  			if (OidIsValid(toast_relid))
  			{
  				rel = relation_open(toast_relid, AccessExclusiveLock);
! 				RelationSetNewRelfilenode(rel, RecentXmin);
  				heap_close(rel, NoLock);
  			}
  
--- 1130,1141 ----
  			if (OidIsValid(toast_relid))
  			{
  				rel = relation_open(toast_relid, AccessExclusiveLock);
! 				RelationSetNewRelfilenode(rel, RecentXmin, InvalidTransactionId);
! 
! 				/*
! 				 * We don't need a cmin here because there can't be any open
! 				 * cursors in the same session as the TRUNCATE command.
! 				 */
  				heap_close(rel, NoLock);
  			}
  
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
new file mode 100644
index a59950e..17303e8
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*************** RelationBuildLocalRelation(const char *r
*** 2605,2611 ****
   * the XIDs that will be put into the new relation contents.
   */
  void
! RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
  {
  	Oid			newrelfilenode;
  	RelFileNodeBackend newrnode;
--- 2605,2612 ----
   * the XIDs that will be put into the new relation contents.
   */
  void
! RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid,
! 						  TransactionId validxmin)
  {
  	Oid			newrelfilenode;
  	RelFileNodeBackend newrnode;
*************** RelationSetNewRelfilenode(Relation relat
*** 2674,2679 ****
--- 2675,2684 ----
  	}
  	classform->relfrozenxid = freezeXid;
  
+ 	if (validxmin != InvalidTransactionId &&
+ 		TransactionIdPrecedes(classform->relvalidxmin, validxmin))
+ 		classform->relvalidxmin = validxmin;
+ 
  	simple_heap_update(pg_class, &tuple->t_self, tuple);
  	CatalogUpdateIndexes(pg_class, tuple);
  
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
new file mode 100644
index 78c3c9e..acd2558
*** a/src/include/catalog/catversion.h
--- b/src/include/catalog/catversion.h
***************
*** 53,58 ****
   */
  
  /*							yyyymmddN */
! #define CATALOG_VERSION_NO	201202083
  
  #endif
--- 53,58 ----
   */
  
  /*							yyyymmddN */
! #define CATALOG_VERSION_NO	201202091
  
  #endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
new file mode 100644
index 3b01bb4..8f1ed9b
*** a/src/include/catalog/pg_class.h
--- b/src/include/catalog/pg_class.h
*************** CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI
*** 67,72 ****
--- 67,73 ----
  	bool		relhastriggers; /* has (or has had) any TRIGGERs */
  	bool		relhassubclass; /* has (or has had) derived classes */
  	TransactionId relfrozenxid; /* all Xids < this are frozen in this rel */
+ 	TransactionId relvalidxmin;	/* data is only valid for snapshots > this Xid */
  
  #ifdef CATALOG_VARLEN			/* variable-length fields start here */
  	/* NOTE: These fields are not present in a relcache entry's rd_rel field. */
*************** CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI
*** 77,83 ****
  
  /* Size of fixed part of pg_class tuples, not counting var-length fields */
  #define CLASS_TUPLE_SIZE \
! 	 (offsetof(FormData_pg_class,relfrozenxid) + sizeof(TransactionId))
  
  /* ----------------
   *		Form_pg_class corresponds to a pointer to a tuple with
--- 78,84 ----
  
  /* Size of fixed part of pg_class tuples, not counting var-length fields */
  #define CLASS_TUPLE_SIZE \
! 	 (offsetof(FormData_pg_class,relvalidxmin) + sizeof(TransactionId))
  
  /* ----------------
   *		Form_pg_class corresponds to a pointer to a tuple with
*************** typedef FormData_pg_class *Form_pg_class
*** 91,97 ****
   * ----------------
   */
  
! #define Natts_pg_class					27
  #define Anum_pg_class_relname			1
  #define Anum_pg_class_relnamespace		2
  #define Anum_pg_class_reltype			3
--- 92,98 ----
   * ----------------
   */
  
! #define Natts_pg_class					28
  #define Anum_pg_class_relname			1
  #define Anum_pg_class_relnamespace		2
  #define Anum_pg_class_reltype			3
*************** typedef FormData_pg_class *Form_pg_class
*** 117,124 ****
  #define Anum_pg_class_relhastriggers	23
  #define Anum_pg_class_relhassubclass	24
  #define Anum_pg_class_relfrozenxid		25
! #define Anum_pg_class_relacl			26
! #define Anum_pg_class_reloptions		27
  
  /* ----------------
   *		initial contents of pg_class
--- 118,126 ----
  #define Anum_pg_class_relhastriggers	23
  #define Anum_pg_class_relhassubclass	24
  #define Anum_pg_class_relfrozenxid		25
! #define Anum_pg_class_relvalidxmin		26
! #define Anum_pg_class_relacl			27
! #define Anum_pg_class_reloptions		28
  
  /* ----------------
   *		initial contents of pg_class
*************** typedef FormData_pg_class *Form_pg_class
*** 130,142 ****
   */
  
  /* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
! DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 _null_ _null_ ));
  DESCR("");
! DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 _null_ _null_ ));
  DESCR("");
! DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 26 0 t f f f f 3 _null_ _null_ ));
  DESCR("");
! DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
  DESCR("");
  
  
--- 132,144 ----
   */
  
  /* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
! DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 0 _null_ _null_ ));
  DESCR("");
! DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 0 _null_ _null_ ));
  DESCR("");
! DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 26 0 t f f f f 3 0 _null_ _null_ ));
  DESCR("");
! DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f 3 0 _null_ _null_ ));
  DESCR("");
  
  
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
new file mode 100644
index 2f39486..60c2eb6
*** a/src/include/utils/relcache.h
--- b/src/include/utils/relcache.h
*************** extern Relation RelationBuildLocalRelati
*** 76,82 ****
   * Routine to manage assignment of new relfilenode to a relation
   */
  extern void RelationSetNewRelfilenode(Relation relation,
! 						  TransactionId freezeXid);
  
  /*
   * Routines for flushing/rebuilding relcache entries in various scenarios
--- 76,83 ----
   * Routine to manage assignment of new relfilenode to a relation
   */
  extern void RelationSetNewRelfilenode(Relation relation,
! 						  TransactionId freezeXid,
! 						  TransactionId validxmin);
  
  /*
   * Routines for flushing/rebuilding relcache entries in various scenarios
diff --git a/src/test/isolation/expected/truncate.out b/src/test/isolation/expected/truncate.out
new file mode 100644
index ...2b9f161
*** a/src/test/isolation/expected/truncate.out
--- b/src/test/isolation/expected/truncate.out
***************
*** 0 ****
--- 1,63 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: begin select roll trunc
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ 1              
+ step roll: ROLLBACK;
+ step trunc: TRUNCATE TABLE foo;
+ 
+ starting permutation: begin select trunc roll
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ 1              
+ step trunc: TRUNCATE TABLE foo; <waiting ...>
+ step roll: ROLLBACK;
+ step trunc: <... completed>
+ 
+ starting permutation: begin trunc select roll
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step trunc: TRUNCATE TABLE foo;
+ step select: SELECT * FROM foo;
+ ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
+ step roll: ROLLBACK;
+ 
+ starting permutation: trunc begin select roll
+ step trunc: TRUNCATE TABLE foo;
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ step roll: ROLLBACK;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
new file mode 100644
index 669c0f2..bba722a
*** a/src/test/isolation/isolation_schedule
--- b/src/test/isolation/isolation_schedule
*************** test: fk-contention
*** 14,16 ****
--- 14,17 ----
  test: fk-deadlock
  test: fk-deadlock2
  test: eval-plan-qual
+ test: truncate
diff --git a/src/test/isolation/specs/truncate.spec b/src/test/isolation/specs/truncate.spec
new file mode 100644
index ...2c7b707
*** a/src/test/isolation/specs/truncate.spec
--- b/src/test/isolation/specs/truncate.spec
***************
*** 0 ****
--- 1,23 ----
+ setup
+ {
+   CREATE TABLE foo (i int);
+   INSERT INTO foo VALUES (1);
+ }
+ 
+ teardown
+ {
+   DROP TABLE foo;
+ }
+ 
+ session "s1"
+ step "begin"
+ {
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ }
+ step "select"	{ SELECT * FROM foo; }
+ step "roll"		{ ROLLBACK; }
+ 
+ session "s2"
+ step "trunc"	{ TRUNCATE TABLE foo; }
#2Noah Misch
noah@leadboat.com
In reply to: Marti Raudsepp (#1)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Thu, Feb 09, 2012 at 11:11:16PM +0200, Marti Raudsepp wrote:

I've always been a little wary of using the TRUNCATE command due to
the warning in the documentation about it not being "MVCC-safe":
queries may silently give wrong results and it's hard to tell when
they are affected.

That got me thinking, why can't we handle this like a standby server
does -- if some query has data removed from underneath it, it aborts
with a serialization failure.

Does this solution sound like a good idea?

The attached patch is a lame attempt at implementing this. I added a
new pg_class.relvalidxmin attribute which tracks the Xid of the last
TRUNCATE (maybe it should be called reltruncatexid?). Whenever
starting a relation scan with a snapshot older than relvalidxmin, an
error is thrown. This seems to work out well since TRUNCATE updates
pg_class anyway, and anyone who sees the new relfilenode automatically
knows when it was truncated.

I like the design you have chosen. It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound. For example, older
snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
commits; that's a comparable MVCC anomaly. Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC. When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

Should I also add another counter to pg_stat_database_conflicts?
Currently this table is only used on standby servers.

ERROR: canceling statement due to conflict with TRUNCATE TABLE on foo
DETAIL: Rows visible to this transaction have been removed.

My initial reaction is not to portray this like a recovery conflict, since
several aspects distinguish it from all recovery conflict types.

(I have not read your actual patch.)

Thanks,
nm

#3Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#2)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <noah@leadboat.com> wrote:

I like the design you have chosen.  It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound.  For example, older
snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC.  When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe. I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED. Are there examples of this behavior at other
isolation levels?

But I have to admit I'm intrigued by the idea of extending this to
other cases, if there's a reasonable way to do that. For example, if
we could fix things up so that we don't see a table at all if it was
created after we took our snapshot, that would remove one of the
obstacles to marking pages bulk-loaded into that table with FrozenXID
and PD_ALL_VISIBLE from the get-go. I think a lot of people would be
mighty happy about that.

But the necessary semantics seem somewhat different. For TRUNCATE,
you want to throw a serialization error; but is that also what you
want for CREATE TABLE? Or do you just want it to appear empty?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#3)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:

On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <noah@leadboat.com> wrote:

I like the design you have chosen. ?It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound. ?For example, older
snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC. ?When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe. I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED. Are there examples of this behavior at other
isolation levels?

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED. They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ. Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ. I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

But I have to admit I'm intrigued by the idea of extending this to
other cases, if there's a reasonable way to do that. For example, if
we could fix things up so that we don't see a table at all if it was
created after we took our snapshot, that would remove one of the
obstacles to marking pages bulk-loaded into that table with FrozenXID
and PD_ALL_VISIBLE from the get-go. I think a lot of people would be
mighty happy about that.

Exactly.

But the necessary semantics seem somewhat different. For TRUNCATE,
you want to throw a serialization error; but is that also what you
want for CREATE TABLE? Or do you just want it to appear empty?

I think an error helps just as much there. If you create a table and populate
it with data in the same transaction, letting some snapshot see an empty table
is an atomicity failure.

Your comment illustrates a helpful point: this is just another kind of
ordinary serialization failure, one that can happen at any isolation level.
No serial transaction sequence can yield one of these errors.

Thanks,
nm

#5Dan Ports
drkp@csail.mit.edu
In reply to: Robert Haas (#3)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe. I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED. Are there examples of this behavior at other
isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

#6Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#4)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:

On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <noah@leadboat.com> wrote:

I like the design you have chosen. ?It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound. ?For example, older
snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC. ?When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED.  Are there examples of this behavior at other
isolation levels?

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED.  They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah. Well, that's actually an interesting example, because it
illustrates how general this problem is. We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated. But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter. There could also be intermediate cases where
updates are invalidating for some purposes but not others. I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ.  I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

The example that I remember was related to SELECT FOR UPDATE/SELECT
FOR SHARE. The idea of those statements is that you want to prevent
the row from being updated or deleted until some other concurrent
action is complete; for example, in the case of a foreign key, we'd
like to prevent the referenced row from being deleted or updated in
the relevant columns until the inserting transaction is committed.
But it doesn't work, because when the updating or deleting process
gets done with the lock wait, they are still using the same snapshot
as before, and merrily do exactly the the thing that the lock-wait was
supposed to prevent. If an actual UPDATE is used, it's safe (I
think): anyone who was going to UPDATE or DELETE the row will fail
with some kind of serialization error. But a SELECT FOR UPDATE that
commits is treated more like an UPDATE that rolls back: it's as if the
lock never existed. Someone (Florian?) proposed a patch to change
this, but it seemed problematic for reasons I no longer exactly
remember.

When using an actual foreign key, we work around this by taking a new
snapshot to cross-check that things haven't changed under us, but
user-level code can't do that. At READ COMMITTED, depending on the
situation, either the fact that we take new snapshots pretty
frequently or the EPQ machinery sometimes make things work sensibly
anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies.
But REPEATABLE READ has no protection. I wish I could find the thread
where we discussed this before.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#6)
Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas <robertmhaas@gmail.com> wrote:

The example that I remember was related to SELECT FOR
UPDATE/SELECT FOR SHARE. The idea of those statements is that you
want to prevent the row from being updated or deleted until some
other concurrent action is complete; for example, in the case of a
foreign key, we'd like to prevent the referenced row from being
deleted or updated in the relevant columns until the inserting
transaction is committed. But it doesn't work, because when the
updating or deleting process gets done with the lock wait, they
are still using the same snapshot as before, and merrily do
exactly the the thing that the lock-wait was supposed to prevent.

This issue is one which appears to be a problem for people trying to
migrate from Oracle, where a write conflict would be generated.

If an actual UPDATE is used, it's safe (I think): anyone who was
going to UPDATE or DELETE the row will fail with some kind of
serialization error.

Right; a write conflict.

But a SELECT FOR UPDATE that commits is treated more like an
UPDATE that rolls back: it's as if the lock never existed.
Someone (Florian?) proposed a patch to change this, but it seemed
problematic for reasons I no longer exactly remember.

It had to do with only having one xmax and how that worked with
subtransactions.

Of course, besides the technical obstacles, such a semantic change
could break existing code for PostgreSQL users. :-(

When using an actual foreign key, we work around this by taking a
new snapshot to cross-check that things haven't changed under us,
but user-level code can't do that. At READ COMMITTED, depending
on the situation, either the fact that we take new snapshots
pretty frequently or the EPQ machinery sometimes make things work
sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of
anomalies. But REPEATABLE READ has no protection.

Well, personally I have a hard time calling READ COMMITTED behavior
sensible. Consider this:

-- connection 1
test=# create table t (id int not null primary key);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"t_pkey" for table "t"
CREATE TABLE
test=# insert into t select generate_series(1, 10);
INSERT 0 10

-- connection 2
test=# begin;
BEGIN
test=# update t set id = id - 1;
UPDATE 10

-- connection 1
test=# select * from t where id = (select min(id) from t) for
update;
[blocks]

-- connection 2
test=# commit;
COMMIT

-- connection 1
[unblocks]
id
----
(0 rows)

-Kevin

#8Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#7)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Feb 13, 2012 at 10:48 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Well, personally I have a hard time calling READ COMMITTED behavior
sensible.  Consider this:

[ gigantic pile of fail ]

Yeah, that's bad all right. I think it's hard to argue that the
current behavior is sensible; the trick is to figure out something
that's better but doesn't impose too much additional overhead. I
wonder if it's possible to use SSI (or some stripped-down mechanism
along similar lines) to track these kinds of write conflicts, rather
than cluttering the system catalogs with lots more TransactionId
fields. Like, when we TRUNCATE a table, we could essentially make a
note in memory someplace recording the write conflict.

Unfortunately, the full-blown SSI machinery seems too expensive to use
for this, especially on all-read workloads where there are no actual
conflicts but lots of conflict checks. But that could probably be
optimized. The attraction of using something like an in-memory
conflict manager is that it would probably be quite general. We could
fix problems of this type with no on-disk format changes whenever we
discover them (as we will certainly continue to do) just by adding the
appropriate flag-a-conflict calls to the right places in the code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#8)
Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas <robertmhaas@gmail.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Well, personally I have a hard time calling READ COMMITTED
behavior sensible. Consider this:

[ gigantic pile of fail ]

Yeah, that's bad all right. I think it's hard to argue that the
current behavior is sensible; the trick is to figure out something
that's better but doesn't impose too much additional overhead. I
wonder if it's possible to use SSI (or some stripped-down
mechanism along similar lines) to track these kinds of write
conflicts, rather than cluttering the system catalogs with lots
more TransactionId fields. Like, when we TRUNCATE a table, we
could essentially make a note in memory someplace recording the
write conflict.

Potential additional uses of the predicate locking developed for SSI
keep surfacing. At some point we should probably pick a couple of
them and try to fashion an API for the non-blocking predicate
locking logic that serves them and SSI. Since this predicate
locking system is explicitly interested only in read-write
conflicts, it does seem like it could work for SELECT FOR UPDATE
versus writes.

As mentioned once or twice before, it was pretty clear that while
predicate locking is required for SSI and is probably 80% of the C
code in the patch, it is really a separate thing -- we just didn't
want to try to create a "generalized" API based on the one initial
usage. I think that an API based on registering and listening would
be the ticket.

Unfortunately, the full-blown SSI machinery seems too expensive to
use for this, especially on all-read workloads where there are no
actual conflicts but lots of conflict checks.

In an all-read workload, if you actually declare the transactions to
be read-only SSI should not introduce much overhead. If there's
much overhead showing up there at the moment, it would probably be
pretty easy to eliminate. When there are any read-write
transactions active, it's a different story.

But that could probably be optimized.

Undoubtedly. It's disappointing that neither Dan nor I could find
the round tuits to make the kinds of changes in the SSI locking that
you made in some other areas for 9.2. I'm not really sure how the
performance impact breaks down between predicate locking and SSI
proper, although I would tend to start from the assumption that,
like the lines of code, it's 80% in the predicate locking.

The attraction of using something like an in-memory conflict
manager is that it would probably be quite general. We could fix
problems of this type with no on-disk format changes whenever we
discover them (as we will certainly continue to do) just by adding
the appropriate flag-a-conflict calls to the right places in the
code.

Assuming that the problems could be expressed in terms of read-write
conflicts, that's probably largely true. I'm not sure that holds
for some of the funny READ COMMITTED behaviors, though. I think the
only real "cure" there would be to make subtransactions cheap enough
that we could wrap execution of each SELECT and DML statement in a
subtransaction and roll back for another try with a new snapshot on
conflict.

If you want to track something other than read-write conflicts
and/or you want blocking when a conflict is found, you might be
better off looking to bend the heavyweight locks to your purposes.

-Kevin

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#4)
1 attachment(s)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sat, Feb 11, 2012 at 4:46 AM, Noah Misch <noah@leadboat.com> wrote:

But I have to admit I'm intrigued by the idea of extending this to
other cases, if there's a reasonable way to do that.  For example, if
we could fix things up so that we don't see a table at all if it was
created after we took our snapshot, that would remove one of the
obstacles to marking pages bulk-loaded into that table with FrozenXID
and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
mighty happy about that.

Exactly.

But the necessary semantics seem somewhat different.  For TRUNCATE,
you want to throw a serialization error; but is that also what you
want for CREATE TABLE?  Or do you just want it to appear empty?

I think an error helps just as much there.  If you create a table and populate
it with data in the same transaction, letting some snapshot see an empty table
is an atomicity failure.

Your comment illustrates a helpful point: this is just another kind of
ordinary serialization failure, one that can happen at any isolation level.
No serial transaction sequence can yield one of these errors.

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

v2 patch attached, updated to latest HEAD. Patch adds
* a GUC called strict_mvcc to isolate the new behaviour if required
* relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

At present this lacks docs for strict_mvcc and doesn't attempt to
handle CREATE TABLE case yet, but should be straightforward to do so.

Hint bit setting is in separate patch on other thread.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

safe_truncate.v2.patchtext/x-diff; charset=US-ASCII; name=safe_truncate.v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..4387f49 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
-
+bool		StrictMVCC = true;
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	HeapScanDesc scan;
 
 	/*
+	 * If the snapshot is older than relvalidxmin then that either a table
+	 * has only recently been created or that a TRUNCATE has removed data
+	 * that we should still be able to see. Either way, we aren't allowed
+	 * to view the rows in StrictMVCC mode.
+	 */
+	if (StrictMVCC &&
+		TransactionIdIsValid(relation->rd_rel->relvalidxmin) &&
+		TransactionIdIsValid(snapshot->xmax) &&
+		NormalTransactionIdPrecedes(snapshot->xmax, relation->rd_rel->relvalidxmin))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+				 errmsg("canceling statement due to conflict with TRUNCATE TABLE on %s",
+						 NameStr(relation->rd_rel->relname)),
+				 errdetail("User query attempting to see row versions that have been removed.")));
+	}
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..3f9bd5d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel->relhastriggers);
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel->relhassubclass);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel->relfrozenxid);
+	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel->relvalidxmin);
 	if (relacl != (Datum) 0)
 		values[Anum_pg_class_relacl - 1] = relacl;
 	else
@@ -884,6 +885,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		new_rel_reltup->relfrozenxid = InvalidTransactionId;
 	}
 
+	new_rel_reltup->relvalidxmin = InvalidTransactionId;
 	new_rel_reltup->relowner = relowner;
 	new_rel_reltup->reltype = new_type_oid;
 	new_rel_reltup->reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfbe642..0d96a6a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2854,7 +2854,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 		}
 
 		/* We'll build a new physical relation for the index */
-		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
+		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
 
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b40e57b..0578241 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -82,6 +82,8 @@ static int	elevel = -1;
 
 static MemoryContext anl_context = NULL;
 
+static TransactionId OldestXmin;
+
 static BufferAccessStrategy vac_strategy;
 
 
@@ -538,7 +540,8 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 							totalrows,
 							visibilitymap_count(onerel),
 							hasindex,
-							InvalidTransactionId);
+							InvalidTransactionId,
+							OldestXmin);
 
 	/*
 	 * Same for indexes. Vacuum always scans all indexes, so if we're part of
@@ -558,6 +561,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 								totalindexrows,
 								0,
 								false,
+								InvalidTransactionId,
 								InvalidTransactionId);
 		}
 	}
@@ -1025,7 +1029,6 @@ acquire_sample_rows(Relation onerel, HeapTuple *rows, int targrows,
 	double		deadrows = 0;	/* # dead rows seen */
 	double		rowstoskip = -1;	/* -1 means not set yet */
 	BlockNumber totalblocks;
-	TransactionId OldestXmin;
 	BlockSamplerData bs;
 	double		rstate;
 
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 1f95abc..ab4aec2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -287,7 +287,7 @@ ResetSequence(Oid seq_relid)
 	 * Create a new storage file for the sequence.	We want to keep the
 	 * sequence's relfrozenxid at 0, since it won't contain any unfrozen XIDs.
 	 */
-	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId);
+	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId, InvalidTransactionId);
 
 	/*
 	 * Insert the modified tuple into the new storage file.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cd4490a..aa9d2e9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "access/sysattr.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1121,7 +1122,7 @@ ExecuteTruncate(TruncateStmt *stmt)
 			 * as the relfilenode value. The old storage file is scheduled for
 			 * deletion at commit.
 			 */
-			RelationSetNewRelfilenode(rel, RecentXmin);
+			RelationSetNewRelfilenode(rel, RecentXmin, GetCurrentTransactionId());
 
 			heap_relid = RelationGetRelid(rel);
 			toast_relid = rel->rd_rel->reltoastrelid;
@@ -1132,7 +1133,12 @@ ExecuteTruncate(TruncateStmt *stmt)
 			if (OidIsValid(toast_relid))
 			{
 				rel = relation_open(toast_relid, AccessExclusiveLock);
-				RelationSetNewRelfilenode(rel, RecentXmin);
+				RelationSetNewRelfilenode(rel, RecentXmin, InvalidTransactionId);
+
+				/*
+				 * We don't need a cmin here because there can't be any open
+				 * cursors in the same session as the TRUNCATE command.
+				 */
 				heap_close(rel, NoLock);
 			}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 353af50..7609e59 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -573,7 +573,8 @@ void
 vac_update_relstats(Relation relation,
 					BlockNumber num_pages, double num_tuples,
 					BlockNumber num_all_visible_pages,
-					bool hasindex, TransactionId frozenxid)
+					bool hasindex, TransactionId frozenxid,
+					TransactionId oldestxmin)
 {
 	Oid			relid = RelationGetRelid(relation);
 	Relation	rd;
@@ -647,6 +648,17 @@ vac_update_relstats(Relation relation,
 		dirty = true;
 	}
 
+	/*
+	 * Reset relvalidxmin if its old enough
+	 */
+	if (TransactionIdIsValid(oldestxmin) &&
+		TransactionIdIsNormal(pgcform->relvalidxmin) &&
+		TransactionIdPrecedes(pgcform->relvalidxmin, oldestxmin))
+	{
+		pgcform->relvalidxmin = InvalidTransactionId;
+		dirty = true;
+	}
+
 	/* If anything changed, write out the tuple. */
 	if (dirty)
 		heap_inplace_update(rd, ctup);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2fc749e..46d6ab8 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -255,7 +255,8 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 						new_rel_tuples,
 						new_rel_allvisible,
 						vacrelstats->hasindex,
-						new_frozen_xid);
+						new_frozen_xid,
+						OldestXmin);
 
 	/* report results to the stats collector, too */
 	pgstat_report_vacuum(RelationGetRelid(onerel),
@@ -1185,6 +1186,7 @@ lazy_cleanup_index(Relation indrel,
 							stats->num_index_tuples,
 							0,
 							false,
+							InvalidTransactionId,
 							InvalidTransactionId);
 
 	ereport(elevel,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a59950e..17303e8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2605,7 +2605,8 @@ RelationBuildLocalRelation(const char *relname,
  * the XIDs that will be put into the new relation contents.
  */
 void
-RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
+RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid,
+						  TransactionId validxmin)
 {
 	Oid			newrelfilenode;
 	RelFileNodeBackend newrnode;
@@ -2674,6 +2675,10 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
 	}
 	classform->relfrozenxid = freezeXid;
 
+	if (validxmin != InvalidTransactionId &&
+		TransactionIdPrecedes(classform->relvalidxmin, validxmin))
+		classform->relvalidxmin = validxmin;
+
 	simple_heap_update(pg_class, &tuple->t_self, tuple);
 	CatalogUpdateIndexes(pg_class, tuple);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 486bdcd..1561889 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1430,6 +1430,19 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"strict_mvcc", PGC_POSTMASTER, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Enables backward compatibility mode of strict MVCC "
+						 "for TRUNCATE and CREATE TABLE."),
+			gettext_noop("When enabled viewing a truncated or newly created table "
+						 "will throw a serialization error to prevent MVCC "
+						 "discrepancies that were possible priot to 9.2.")
+		},
+		&StrictMVCC,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"quote_all_identifiers", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("When generating SQL fragments, quote all identifiers."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 96da086..f56e374 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -548,6 +548,7 @@
 #quote_all_identifiers = off
 #sql_inheritance = on
 #standard_conforming_strings = on
+#strict_mvcc = on
 #synchronize_seqscans = on
 
 # - Other Platforms and Clients -
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 0335347..223f157 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201203021
+#define CATALOG_VERSION_NO	201203031
 
 #endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 1567206..d6ee70c 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -67,6 +67,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
 	bool		relhastriggers; /* has (or has had) any TRIGGERs */
 	bool		relhassubclass; /* has (or has had) derived classes */
 	TransactionId relfrozenxid; /* all Xids < this are frozen in this rel */
+	TransactionId relvalidxmin;	/* data is only valid for snapshots > this Xid */
 
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* NOTE: These fields are not present in a relcache entry's rd_rel field. */
@@ -77,7 +78,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
 
 /* Size of fixed part of pg_class tuples, not counting var-length fields */
 #define CLASS_TUPLE_SIZE \
-	 (offsetof(FormData_pg_class,relfrozenxid) + sizeof(TransactionId))
+	 (offsetof(FormData_pg_class,relvalidxmin) + sizeof(TransactionId))
 
 /* ----------------
  *		Form_pg_class corresponds to a pointer to a tuple with
@@ -91,7 +92,7 @@ typedef FormData_pg_class *Form_pg_class;
  * ----------------
  */
 
-#define Natts_pg_class					27
+#define Natts_pg_class					28
 #define Anum_pg_class_relname			1
 #define Anum_pg_class_relnamespace		2
 #define Anum_pg_class_reltype			3
@@ -117,8 +118,9 @@ typedef FormData_pg_class *Form_pg_class;
 #define Anum_pg_class_relhastriggers	23
 #define Anum_pg_class_relhassubclass	24
 #define Anum_pg_class_relfrozenxid		25
-#define Anum_pg_class_relacl			26
-#define Anum_pg_class_reloptions		27
+#define Anum_pg_class_relvalidxmin		26
+#define Anum_pg_class_relacl			27
+#define Anum_pg_class_reloptions		28
 
 /* ----------------
  *		initial contents of pg_class
@@ -130,13 +132,13 @@ typedef FormData_pg_class *Form_pg_class;
  */
 
 /* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
-DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
 
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4526648..fa79231 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -151,7 +151,8 @@ extern void vac_update_relstats(Relation relation,
 					double num_tuples,
 					BlockNumber num_all_visible_pages,
 					bool hasindex,
-					TransactionId frozenxid);
+					TransactionId frozenxid,
+					TransactionId oldestxmin);
 extern void vacuum_set_xid_limits(int freeze_min_age, int freeze_table_age,
 					  bool sharedRel,
 					  TransactionId *oldestXmin,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 610cb59..9d2be31 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -342,6 +342,8 @@ extern ProcessingMode Mode;
 
 #define GetProcessingMode() Mode
 
+/* in access/heap/heapam.c */
+extern bool	StrictMVCC;
 
 /*****************************************************************************
  *	  pinit.h --															 *
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f39486..60c2eb6 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -76,7 +76,8 @@ extern Relation RelationBuildLocalRelation(const char *relname,
  * Routine to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation,
-						  TransactionId freezeXid);
+						  TransactionId freezeXid,
+						  TransactionId validxmin);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 669c0f2..bba722a 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -14,3 +14,4 @@ test: fk-contention
 test: fk-deadlock
 test: fk-deadlock2
 test: eval-plan-qual
+test: truncate
diff --git a/src/test/isolation/expected/truncate.out b/src/test/isolation/expected/truncate.out
new file mode 100644
index ...2b9f161
*** a/src/test/isolation/expected/truncate.out
--- b/src/test/isolation/expected/truncate.out
***************
*** 0 ****
--- 1,63 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: begin select roll trunc
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ 1              
+ step roll: ROLLBACK;
+ step trunc: TRUNCATE TABLE foo;
+ 
+ starting permutation: begin select trunc roll
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ 1              
+ step trunc: TRUNCATE TABLE foo; <waiting ...>
+ step roll: ROLLBACK;
+ step trunc: <... completed>
+ 
+ starting permutation: begin trunc select roll
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step trunc: TRUNCATE TABLE foo;
+ step select: SELECT * FROM foo;
+ ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
+ step roll: ROLLBACK;
+ 
+ starting permutation: trunc begin select roll
+ step trunc: TRUNCATE TABLE foo;
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ step roll: ROLLBACK;
diff --git a/src/test/isolation/specs/truncate.spec b/src/test/isolation/specs/truncate.spec
new file mode 100644
index ...2c7b707
*** a/src/test/isolation/specs/truncate.spec
--- b/src/test/isolation/specs/truncate.spec
***************
*** 0 ****
--- 1,23 ----
+ setup
+ {
+   CREATE TABLE foo (i int);
+   INSERT INTO foo VALUES (1);
+ }
+ 
+ teardown
+ {
+   DROP TABLE foo;
+ }
+ 
+ session "s1"
+ step "begin"
+ {
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ }
+ step "select"	{ SELECT * FROM foo; }
+ step "roll"		{ ROLLBACK; }
+ 
+ session "s2"
+ step "trunc"	{ TRUNCATE TABLE foo; }
#11Marti Raudsepp
marti@juffo.org
In reply to: Simon Riggs (#10)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

Thanks! This approach wasn't universally liked, but if it gets us
tangible benefits (COPY with frozenxid) then I guess it's a reason to
reconsider.

v2 patch attached, updated to latest HEAD. Patch adds
* relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

Personally I'd rather keep this out of ANALYZE -- since its purpose is
to collect stats; VACUUM is responsible for correctness (xid
wraparound etc). But I don't feel strongly about this.

A more important consideration is how this interacts with hot standby.
Currently you compare OldestXmin to relvalidxmin to decide when to
reset it. But the standby's OldestXmin may be older than the master's.
(If VACUUM removes rows then this causes a recovery conflict, but
AFAICT that won't happen if only relvalidxmin changes)

It might be more robust to wait until relfrozenxid exceeds
relvalidxmin -- by then, recovery conflict mechanisms will have taken
care of killing all older snapshots, or am I mistaken?

And a few typos the code...

+ gettext_noop("When enabled viewing a truncated or newly created table "
+ "will throw a serialization error to prevent MVCC "
+ "discrepancies that were possible priot to 9.2.")

"prior" not "priot"

+ * Reset relvalidxmin if its old enough

Should be "it's" in this context.

Regards,
Marti

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Marti Raudsepp (#11)
1 attachment(s)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

Thanks! This approach wasn't universally liked, but if it gets us
tangible benefits (COPY with frozenxid) then I guess it's a reason to
reconsider.

Comments I see support this idea. If we did this purely for truncate
correctness I probably wouldn't care either, which is why I didn't
read this thread in the first place.

v2 patch attached, updated to latest HEAD. Patch adds
* relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

Personally I'd rather keep this out of ANALYZE -- since its purpose is
to collect stats; VACUUM is responsible for correctness (xid
wraparound etc). But I don't feel strongly about this.

If there were a reason to do it, then I would agree. Later you point
out a reason, so I will make this change.

A more important consideration is how this interacts with hot standby.
Currently you compare OldestXmin to relvalidxmin to decide when to
reset it. But the standby's OldestXmin may be older than the master's.
(If VACUUM removes rows then this causes a recovery conflict, but
AFAICT that won't happen if only relvalidxmin changes)

As of 9.1, the standby's oldestxmin is incorporated into the master's
via hot_standby_feedback, so it wouldn't typically be a problem these
days. It's possible that the standby has this set off by choice, in
which case anomalies could exist in the case that a VACUUM doesn't
clean any rows, as you say.

So we'll use vacrelstats->latestRemovedXid instead of OldestXmin when
we call vac_update_relstats()
which means ANALYZE should always pass InvalidTransactionId.

It might be more robust to wait until relfrozenxid exceeds
relvalidxmin -- by then, recovery conflict mechanisms will have taken
care of killing all older snapshots, or am I mistaken?

It would be better to set it as early as possible to reduce the cost
of the test in heap_begin_scan()

And a few typos the code...

+ gettext_noop("When enabled viewing a truncated or newly created table "
+ "will throw a serialization error to prevent MVCC "
+ "discrepancies that were possible priot to 9.2.")

"prior" not "priot"

Yep

+ * Reset relvalidxmin if its old enough

Should be "it's" in this context.

Cool, thanks for the review.

v3 attached.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

safe_truncate.v3.patchtext/x-diff; charset=US-ASCII; name=safe_truncate.v3.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..4387f49 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
-
+bool		StrictMVCC = true;
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	HeapScanDesc scan;
 
 	/*
+	 * If the snapshot is older than relvalidxmin then that either a table
+	 * has only recently been created or that a TRUNCATE has removed data
+	 * that we should still be able to see. Either way, we aren't allowed
+	 * to view the rows in StrictMVCC mode.
+	 */
+	if (StrictMVCC &&
+		TransactionIdIsValid(relation->rd_rel->relvalidxmin) &&
+		TransactionIdIsValid(snapshot->xmax) &&
+		NormalTransactionIdPrecedes(snapshot->xmax, relation->rd_rel->relvalidxmin))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+				 errmsg("canceling statement due to conflict with TRUNCATE TABLE on %s",
+						 NameStr(relation->rd_rel->relname)),
+				 errdetail("User query attempting to see row versions that have been removed.")));
+	}
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..3f9bd5d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel->relhastriggers);
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel->relhassubclass);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel->relfrozenxid);
+	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel->relvalidxmin);
 	if (relacl != (Datum) 0)
 		values[Anum_pg_class_relacl - 1] = relacl;
 	else
@@ -884,6 +885,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		new_rel_reltup->relfrozenxid = InvalidTransactionId;
 	}
 
+	new_rel_reltup->relvalidxmin = InvalidTransactionId;
 	new_rel_reltup->relowner = relowner;
 	new_rel_reltup->reltype = new_type_oid;
 	new_rel_reltup->reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfbe642..0d96a6a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2854,7 +2854,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 		}
 
 		/* We'll build a new physical relation for the index */
-		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
+		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
 
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b40e57b..44b891c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -538,6 +538,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 							totalrows,
 							visibilitymap_count(onerel),
 							hasindex,
+							InvalidTransactionId,
 							InvalidTransactionId);
 
 	/*
@@ -558,6 +559,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 								totalindexrows,
 								0,
 								false,
+								InvalidTransactionId,
 								InvalidTransactionId);
 		}
 	}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 1f95abc..ab4aec2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -287,7 +287,7 @@ ResetSequence(Oid seq_relid)
 	 * Create a new storage file for the sequence.	We want to keep the
 	 * sequence's relfrozenxid at 0, since it won't contain any unfrozen XIDs.
 	 */
-	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId);
+	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId, InvalidTransactionId);
 
 	/*
 	 * Insert the modified tuple into the new storage file.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cd4490a..aa9d2e9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "access/sysattr.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1121,7 +1122,7 @@ ExecuteTruncate(TruncateStmt *stmt)
 			 * as the relfilenode value. The old storage file is scheduled for
 			 * deletion at commit.
 			 */
-			RelationSetNewRelfilenode(rel, RecentXmin);
+			RelationSetNewRelfilenode(rel, RecentXmin, GetCurrentTransactionId());
 
 			heap_relid = RelationGetRelid(rel);
 			toast_relid = rel->rd_rel->reltoastrelid;
@@ -1132,7 +1133,12 @@ ExecuteTruncate(TruncateStmt *stmt)
 			if (OidIsValid(toast_relid))
 			{
 				rel = relation_open(toast_relid, AccessExclusiveLock);
-				RelationSetNewRelfilenode(rel, RecentXmin);
+				RelationSetNewRelfilenode(rel, RecentXmin, InvalidTransactionId);
+
+				/*
+				 * We don't need a cmin here because there can't be any open
+				 * cursors in the same session as the TRUNCATE command.
+				 */
 				heap_close(rel, NoLock);
 			}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 353af50..9b36cd7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -573,7 +573,8 @@ void
 vac_update_relstats(Relation relation,
 					BlockNumber num_pages, double num_tuples,
 					BlockNumber num_all_visible_pages,
-					bool hasindex, TransactionId frozenxid)
+					bool hasindex, TransactionId frozenxid,
+					TransactionId oldestxmin)
 {
 	Oid			relid = RelationGetRelid(relation);
 	Relation	rd;
@@ -647,6 +648,21 @@ vac_update_relstats(Relation relation,
 		dirty = true;
 	}
 
+	/*
+	 * Reset relvalidxmin if it's old enough. Resetting this value
+	 * could cause MVCC failures on a standby that doesn't have
+	 * hot_standby_feedback enabled, so only reset the value if we
+	 * are certain that all prior snapshots have ended or been
+	 * removed via conflict. Value should always be invalid for indexes.
+	 */
+	if (TransactionIdIsValid(oldestxmin) &&
+		TransactionIdIsNormal(pgcform->relvalidxmin) &&
+		TransactionIdPrecedes(pgcform->relvalidxmin, oldestxmin))
+	{
+		pgcform->relvalidxmin = InvalidTransactionId;
+		dirty = true;
+	}
+
 	/* If anything changed, write out the tuple. */
 	if (dirty)
 		heap_inplace_update(rd, ctup);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2fc749e..3b88652 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -255,7 +255,8 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 						new_rel_tuples,
 						new_rel_allvisible,
 						vacrelstats->hasindex,
-						new_frozen_xid);
+						new_frozen_xid,
+						vacrelstats->latestRemovedXid);
 
 	/* report results to the stats collector, too */
 	pgstat_report_vacuum(RelationGetRelid(onerel),
@@ -1185,6 +1186,7 @@ lazy_cleanup_index(Relation indrel,
 							stats->num_index_tuples,
 							0,
 							false,
+							InvalidTransactionId,
 							InvalidTransactionId);
 
 	ereport(elevel,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a59950e..17303e8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2605,7 +2605,8 @@ RelationBuildLocalRelation(const char *relname,
  * the XIDs that will be put into the new relation contents.
  */
 void
-RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
+RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid,
+						  TransactionId validxmin)
 {
 	Oid			newrelfilenode;
 	RelFileNodeBackend newrnode;
@@ -2674,6 +2675,10 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
 	}
 	classform->relfrozenxid = freezeXid;
 
+	if (validxmin != InvalidTransactionId &&
+		TransactionIdPrecedes(classform->relvalidxmin, validxmin))
+		classform->relvalidxmin = validxmin;
+
 	simple_heap_update(pg_class, &tuple->t_self, tuple);
 	CatalogUpdateIndexes(pg_class, tuple);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 486bdcd..e4a19aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1430,6 +1430,19 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"strict_mvcc", PGC_POSTMASTER, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Enables backward compatibility mode of strict MVCC "
+						 "for TRUNCATE and CREATE TABLE."),
+			gettext_noop("When enabled viewing a truncated or newly created table "
+						 "will throw a serialization error to prevent MVCC "
+						 "discrepancies that were possible prior to 9.2.")
+		},
+		&StrictMVCC,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"quote_all_identifiers", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("When generating SQL fragments, quote all identifiers."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 96da086..f56e374 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -548,6 +548,7 @@
 #quote_all_identifiers = off
 #sql_inheritance = on
 #standard_conforming_strings = on
+#strict_mvcc = on
 #synchronize_seqscans = on
 
 # - Other Platforms and Clients -
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 0335347..223f157 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201203021
+#define CATALOG_VERSION_NO	201203031
 
 #endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 1567206..d6ee70c 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -67,6 +67,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
 	bool		relhastriggers; /* has (or has had) any TRIGGERs */
 	bool		relhassubclass; /* has (or has had) derived classes */
 	TransactionId relfrozenxid; /* all Xids < this are frozen in this rel */
+	TransactionId relvalidxmin;	/* data is only valid for snapshots > this Xid */
 
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* NOTE: These fields are not present in a relcache entry's rd_rel field. */
@@ -77,7 +78,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
 
 /* Size of fixed part of pg_class tuples, not counting var-length fields */
 #define CLASS_TUPLE_SIZE \
-	 (offsetof(FormData_pg_class,relfrozenxid) + sizeof(TransactionId))
+	 (offsetof(FormData_pg_class,relvalidxmin) + sizeof(TransactionId))
 
 /* ----------------
  *		Form_pg_class corresponds to a pointer to a tuple with
@@ -91,7 +92,7 @@ typedef FormData_pg_class *Form_pg_class;
  * ----------------
  */
 
-#define Natts_pg_class					27
+#define Natts_pg_class					28
 #define Anum_pg_class_relname			1
 #define Anum_pg_class_relnamespace		2
 #define Anum_pg_class_reltype			3
@@ -117,8 +118,9 @@ typedef FormData_pg_class *Form_pg_class;
 #define Anum_pg_class_relhastriggers	23
 #define Anum_pg_class_relhassubclass	24
 #define Anum_pg_class_relfrozenxid		25
-#define Anum_pg_class_relacl			26
-#define Anum_pg_class_reloptions		27
+#define Anum_pg_class_relvalidxmin		26
+#define Anum_pg_class_relacl			27
+#define Anum_pg_class_reloptions		28
 
 /* ----------------
  *		initial contents of pg_class
@@ -130,13 +132,13 @@ typedef FormData_pg_class *Form_pg_class;
  */
 
 /* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
-DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
 
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4526648..fa79231 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -151,7 +151,8 @@ extern void vac_update_relstats(Relation relation,
 					double num_tuples,
 					BlockNumber num_all_visible_pages,
 					bool hasindex,
-					TransactionId frozenxid);
+					TransactionId frozenxid,
+					TransactionId oldestxmin);
 extern void vacuum_set_xid_limits(int freeze_min_age, int freeze_table_age,
 					  bool sharedRel,
 					  TransactionId *oldestXmin,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 610cb59..9d2be31 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -342,6 +342,8 @@ extern ProcessingMode Mode;
 
 #define GetProcessingMode() Mode
 
+/* in access/heap/heapam.c */
+extern bool	StrictMVCC;
 
 /*****************************************************************************
  *	  pinit.h --															 *
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f39486..60c2eb6 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -76,7 +76,8 @@ extern Relation RelationBuildLocalRelation(const char *relname,
  * Routine to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation,
-						  TransactionId freezeXid);
+						  TransactionId freezeXid,
+						  TransactionId validxmin);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 669c0f2..bba722a 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -14,3 +14,4 @@ test: fk-contention
 test: fk-deadlock
 test: fk-deadlock2
 test: eval-plan-qual
+test: truncate
diff --git a/src/test/isolation/expected/truncate.out b/src/test/isolation/expected/truncate.out
new file mode 100644
index ...2b9f161
*** a/src/test/isolation/expected/truncate.out
--- b/src/test/isolation/expected/truncate.out
***************
*** 0 ****
--- 1,63 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: begin select roll trunc
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ 1              
+ step roll: ROLLBACK;
+ step trunc: TRUNCATE TABLE foo;
+ 
+ starting permutation: begin select trunc roll
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ 1              
+ step trunc: TRUNCATE TABLE foo; <waiting ...>
+ step roll: ROLLBACK;
+ step trunc: <... completed>
+ 
+ starting permutation: begin trunc select roll
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step trunc: TRUNCATE TABLE foo;
+ step select: SELECT * FROM foo;
+ ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
+ step roll: ROLLBACK;
+ 
+ starting permutation: trunc begin select roll
+ step trunc: TRUNCATE TABLE foo;
+ step begin: 
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ 
+ foo            
+ 
+ 1              
+ step select: SELECT * FROM foo;
+ i              
+ 
+ step roll: ROLLBACK;
diff --git a/src/test/isolation/specs/truncate.spec b/src/test/isolation/specs/truncate.spec
new file mode 100644
index ...2c7b707
*** a/src/test/isolation/specs/truncate.spec
--- b/src/test/isolation/specs/truncate.spec
***************
*** 0 ****
--- 1,23 ----
+ setup
+ {
+   CREATE TABLE foo (i int);
+   INSERT INTO foo VALUES (1);
+ }
+ 
+ teardown
+ {
+   DROP TABLE foo;
+ }
+ 
+ session "s1"
+ step "begin"
+ {
+   BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+   -- Force snapshot open
+   SELECT 1 AS foo FROM txid_current();
+ }
+ step "select"	{ SELECT * FROM foo; }
+ step "roll"		{ ROLLBACK; }
+ 
+ session "s2"
+ step "trunc"	{ TRUNCATE TABLE foo; }
#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#12)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

...

v3 attached.

More detailed thoughts show that the test in heap_beginscan_internal()
is not right enough, i.e. wrong.

We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
needs to be a specific xid, not an xmin because otherwise we can get
concurrent transactions failing, not just older transactions.

If we're going freeze tuples on load this needs to be watertight, so
some minor rework needed.

Of course, if we only have a valid xid on the class we might get new
columns added when we do repeated SELECT * statements using the same
snapshot while concurrent DDL occurs. That is impractical, so if we
define this as applying to rows it can work; if we want it to apply to
everything its getting more difficult.

Longer term we might fix this by making all catalog access use MVCC,
but that suffers the same problems with ALTER TABLEs that rewrite rows
to add columns. I don't see a neat future solution that is worth
waiting for.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#13)
1 attachment(s)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sun, Mar 4, 2012 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Sat, Mar 3, 2012 at 14:53, Simon Riggs <simon@2ndquadrant.com> wrote:

Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

...

v3 attached.

More detailed thoughts show that the test in heap_beginscan_internal()
is not right enough, i.e. wrong.

We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
needs to be a specific xid, not an xmin because otherwise we can get
concurrent transactions failing, not just older transactions.

Marti, please review this latest version which has new isolation tests added.

This does both TRUNCATE and CREATE TABLE.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

strictmvcc.v1.patchtext/x-diff; charset=US-ASCII; name=strictmvcc.v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c910863..3d90125 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,7 +73,7 @@
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
-
+bool		StrictMVCC = true;
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
@@ -1175,6 +1175,24 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	HeapScanDesc scan;
 
 	/*
+	 * If the snapshot can't see relvalidxid then that means either the table
+	 * is newly created or has recently been truncated. Either way, we aren't
+	 * allowed to view the rows in StrictMVCC mode.
+	 */
+	if (IsMVCCSnapshot(snapshot) &&
+		StrictMVCC &&
+		XidInMVCCSnapshot(relation->rd_rel->relvalidxid, snapshot))
+	{
+		/* Unless we created or truncated the table recently ourselves */
+		if (relation->rd_createSubid == InvalidSubTransactionId &&
+			relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("cannot access recently added or recently removed data in %s",
+							 NameStr(relation->rd_rel->relname))));
+	}
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index aef410a..eb93a7c 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -787,6 +787,7 @@ InsertPgClassTuple(Relation pg_class_desc,
 	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel->relhastriggers);
 	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel->relhassubclass);
 	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel->relfrozenxid);
+	values[Anum_pg_class_relvalidxid - 1] = TransactionIdGetDatum(rd_rel->relvalidxid);
 	if (relacl != (Datum) 0)
 		values[Anum_pg_class_relacl - 1] = relacl;
 	else
@@ -872,6 +873,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		 * that will do.
 		 */
 		new_rel_reltup->relfrozenxid = RecentXmin;
+		new_rel_reltup->relvalidxid = GetCurrentTransactionId();
 	}
 	else
 	{
@@ -882,6 +884,7 @@ AddNewRelationTuple(Relation pg_class_desc,
 		 * commands/sequence.c.)
 		 */
 		new_rel_reltup->relfrozenxid = InvalidTransactionId;
+		new_rel_reltup->relvalidxid = InvalidTransactionId;
 	}
 
 	new_rel_reltup->relowner = relowner;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfbe642..0d96a6a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2854,7 +2854,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 		}
 
 		/* We'll build a new physical relation for the index */
-		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
+		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
 
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index b40e57b..44b891c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -538,6 +538,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 							totalrows,
 							visibilitymap_count(onerel),
 							hasindex,
+							InvalidTransactionId,
 							InvalidTransactionId);
 
 	/*
@@ -558,6 +559,7 @@ do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, bool inh)
 								totalindexrows,
 								0,
 								false,
+								InvalidTransactionId,
 								InvalidTransactionId);
 		}
 	}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 1f95abc..ab4aec2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -287,7 +287,7 @@ ResetSequence(Oid seq_relid)
 	 * Create a new storage file for the sequence.	We want to keep the
 	 * sequence's relfrozenxid at 0, since it won't contain any unfrozen XIDs.
 	 */
-	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId);
+	RelationSetNewRelfilenode(seq_rel, InvalidTransactionId, InvalidTransactionId);
 
 	/*
 	 * Insert the modified tuple into the new storage file.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cd4490a..aa9d2e9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "access/sysattr.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@@ -1121,7 +1122,7 @@ ExecuteTruncate(TruncateStmt *stmt)
 			 * as the relfilenode value. The old storage file is scheduled for
 			 * deletion at commit.
 			 */
-			RelationSetNewRelfilenode(rel, RecentXmin);
+			RelationSetNewRelfilenode(rel, RecentXmin, GetCurrentTransactionId());
 
 			heap_relid = RelationGetRelid(rel);
 			toast_relid = rel->rd_rel->reltoastrelid;
@@ -1132,7 +1133,12 @@ ExecuteTruncate(TruncateStmt *stmt)
 			if (OidIsValid(toast_relid))
 			{
 				rel = relation_open(toast_relid, AccessExclusiveLock);
-				RelationSetNewRelfilenode(rel, RecentXmin);
+				RelationSetNewRelfilenode(rel, RecentXmin, InvalidTransactionId);
+
+				/*
+				 * We don't need a cmin here because there can't be any open
+				 * cursors in the same session as the TRUNCATE command.
+				 */
 				heap_close(rel, NoLock);
 			}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 353af50..a0cfe54 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -573,7 +573,8 @@ void
 vac_update_relstats(Relation relation,
 					BlockNumber num_pages, double num_tuples,
 					BlockNumber num_all_visible_pages,
-					bool hasindex, TransactionId frozenxid)
+					bool hasindex, TransactionId frozenxid,
+					TransactionId oldestxmin)
 {
 	Oid			relid = RelationGetRelid(relation);
 	Relation	rd;
@@ -647,6 +648,21 @@ vac_update_relstats(Relation relation,
 		dirty = true;
 	}
 
+	/*
+	 * Reset relvalidxid if it's old enough. Resetting this value
+	 * could cause MVCC failures on a standby that doesn't have
+	 * hot_standby_feedback enabled, so only reset the value if we
+	 * are certain that all prior snapshots have ended or been
+	 * removed via conflict. Value should always be invalid for indexes.
+	 */
+	if (TransactionIdIsValid(oldestxmin) &&
+		TransactionIdIsNormal(pgcform->relvalidxid) &&
+		TransactionIdPrecedes(pgcform->relvalidxid, oldestxmin))
+	{
+		pgcform->relvalidxid = InvalidTransactionId;
+		dirty = true;
+	}
+
 	/* If anything changed, write out the tuple. */
 	if (dirty)
 		heap_inplace_update(rd, ctup);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2fc749e..3b88652 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -255,7 +255,8 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 						new_rel_tuples,
 						new_rel_allvisible,
 						vacrelstats->hasindex,
-						new_frozen_xid);
+						new_frozen_xid,
+						vacrelstats->latestRemovedXid);
 
 	/* report results to the stats collector, too */
 	pgstat_report_vacuum(RelationGetRelid(onerel),
@@ -1185,6 +1186,7 @@ lazy_cleanup_index(Relation indrel,
 							stats->num_index_tuples,
 							0,
 							false,
+							InvalidTransactionId,
 							InvalidTransactionId);
 
 	ereport(elevel,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a59950e..10ac6b8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2605,7 +2605,8 @@ RelationBuildLocalRelation(const char *relname,
  * the XIDs that will be put into the new relation contents.
  */
 void
-RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
+RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid,
+						  TransactionId validxid)
 {
 	Oid			newrelfilenode;
 	RelFileNodeBackend newrnode;
@@ -2673,6 +2674,7 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
 		classform->relallvisible = 0;
 	}
 	classform->relfrozenxid = freezeXid;
+	classform->relvalidxid = validxid;
 
 	simple_heap_update(pg_class, &tuple->t_self, tuple);
 	CatalogUpdateIndexes(pg_class, tuple);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 486bdcd..e4a19aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1430,6 +1430,19 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"strict_mvcc", PGC_POSTMASTER, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Enables backward compatibility mode of strict MVCC "
+						 "for TRUNCATE and CREATE TABLE."),
+			gettext_noop("When enabled viewing a truncated or newly created table "
+						 "will throw a serialization error to prevent MVCC "
+						 "discrepancies that were possible prior to 9.2.")
+		},
+		&StrictMVCC,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"quote_all_identifiers", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("When generating SQL fragments, quote all identifiers."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 96da086..f56e374 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -548,6 +548,7 @@
 #quote_all_identifiers = off
 #sql_inheritance = on
 #standard_conforming_strings = on
+#strict_mvcc = on
 #synchronize_seqscans = on
 
 # - Other Platforms and Clients -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 31791a7..4b2a119 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -72,9 +72,6 @@ SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
 SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
 SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
 
-/* local functions */
-static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-
 
 /*
  * SetHintBits()
@@ -1239,7 +1236,7 @@ HeapTupleSatisfiesVacuum(HeapTupleHeader tuple, TransactionId OldestXmin,
  * by this function.  This is OK for current uses, because we actually only
  * apply this for known-committed XIDs.
  */
-static bool
+bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
 	uint32		i;
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 0335347..223f157 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201203021
+#define CATALOG_VERSION_NO	201203031
 
 #endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 1567206..4a78bdb 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -67,6 +67,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
 	bool		relhastriggers; /* has (or has had) any TRIGGERs */
 	bool		relhassubclass; /* has (or has had) derived classes */
 	TransactionId relfrozenxid; /* all Xids < this are frozen in this rel */
+	TransactionId relvalidxid;	/* data is only valid for snapshots > this Xid */
 
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* NOTE: These fields are not present in a relcache entry's rd_rel field. */
@@ -77,7 +78,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
 
 /* Size of fixed part of pg_class tuples, not counting var-length fields */
 #define CLASS_TUPLE_SIZE \
-	 (offsetof(FormData_pg_class,relfrozenxid) + sizeof(TransactionId))
+	 (offsetof(FormData_pg_class,relvalidxid) + sizeof(TransactionId))
 
 /* ----------------
  *		Form_pg_class corresponds to a pointer to a tuple with
@@ -91,7 +92,7 @@ typedef FormData_pg_class *Form_pg_class;
  * ----------------
  */
 
-#define Natts_pg_class					27
+#define Natts_pg_class					28
 #define Anum_pg_class_relname			1
 #define Anum_pg_class_relnamespace		2
 #define Anum_pg_class_reltype			3
@@ -117,8 +118,9 @@ typedef FormData_pg_class *Form_pg_class;
 #define Anum_pg_class_relhastriggers	23
 #define Anum_pg_class_relhassubclass	24
 #define Anum_pg_class_relfrozenxid		25
-#define Anum_pg_class_relacl			26
-#define Anum_pg_class_reloptions		27
+#define Anum_pg_class_relvalidxid		26
+#define Anum_pg_class_relacl			27
+#define Anum_pg_class_reloptions		28
 
 /* ----------------
  *		initial contents of pg_class
@@ -130,13 +132,13 @@ typedef FormData_pg_class *Form_pg_class;
  */
 
 /* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
-DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1247 (  pg_type		PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1249 (  pg_attribute	PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1255 (  pg_proc		PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1259 (  pg_class		PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f 3 0 _null_ _null_ ));
 DESCR("");
 
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4526648..fa79231 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -151,7 +151,8 @@ extern void vac_update_relstats(Relation relation,
 					double num_tuples,
 					BlockNumber num_all_visible_pages,
 					bool hasindex,
-					TransactionId frozenxid);
+					TransactionId frozenxid,
+					TransactionId oldestxmin);
 extern void vacuum_set_xid_limits(int freeze_min_age, int freeze_table_age,
 					  bool sharedRel,
 					  TransactionId *oldestXmin,
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 610cb59..9d2be31 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -342,6 +342,8 @@ extern ProcessingMode Mode;
 
 #define GetProcessingMode() Mode
 
+/* in access/heap/heapam.c */
+extern bool	StrictMVCC;
 
 /*****************************************************************************
  *	  pinit.h --															 *
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f39486..0fb2167 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -76,7 +76,8 @@ extern Relation RelationBuildLocalRelation(const char *relname,
  * Routine to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation,
-						  TransactionId freezeXid);
+						  TransactionId freezeXid,
+						  TransactionId validxid);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index 0f8a7f8..ffdcfd5 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -87,4 +87,6 @@ extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTupleHeader tuple,
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 					 uint16 infomask, TransactionId xid);
 
+extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+
 #endif   /* TQUAL_H */
diff --git a/src/test/isolation/expected/createtableandcopy.out b/src/test/isolation/expected/createtableandcopy.out
new file mode 100644
index 0000000..241bda9
--- /dev/null
+++ b/src/test/isolation/expected/createtableandcopy.out
@@ -0,0 +1,81 @@
+Parsed test spec with 2 sessions
+
+starting permutation: begin select roll create
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+ERROR:  relation "foo" does not exist
+step roll: ROLLBACK;
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  COPY foo FROM '/tmp/createtableandcopy.spec.data';
+  COMMIT;
+
+
+starting permutation: begin select create roll
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+ERROR:  relation "foo" does not exist
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  COPY foo FROM '/tmp/createtableandcopy.spec.data';
+  COMMIT;
+
+step roll: ROLLBACK;
+
+starting permutation: begin create select roll
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  COPY foo FROM '/tmp/createtableandcopy.spec.data';
+  COMMIT;
+
+step select: SELECT * FROM foo;
+ERROR:  cannot access recently added or recently removed data in foo
+step roll: ROLLBACK;
+
+starting permutation: create begin select roll
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  COPY foo FROM '/tmp/createtableandcopy.spec.data';
+  COMMIT;
+
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+i              
+
+1              
+2              
+3              
+step roll: ROLLBACK;
diff --git a/src/test/isolation/expected/createtableandinsert.out b/src/test/isolation/expected/createtableandinsert.out
new file mode 100644
index 0000000..b709dee
--- /dev/null
+++ b/src/test/isolation/expected/createtableandinsert.out
@@ -0,0 +1,79 @@
+Parsed test spec with 2 sessions
+
+starting permutation: begin select roll create
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+ERROR:  relation "foo" does not exist
+step roll: ROLLBACK;
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  INSERT INTO foo VALUES (1);
+  COMMIT;
+
+
+starting permutation: begin select create roll
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+ERROR:  relation "foo" does not exist
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  INSERT INTO foo VALUES (1);
+  COMMIT;
+
+step roll: ROLLBACK;
+
+starting permutation: begin create select roll
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  INSERT INTO foo VALUES (1);
+  COMMIT;
+
+step select: SELECT * FROM foo;
+ERROR:  cannot access recently added or recently removed data in foo
+step roll: ROLLBACK;
+
+starting permutation: create begin select roll
+step create: 
+  BEGIN;
+  CREATE TABLE foo (i int);
+  INSERT INTO foo VALUES (1);
+  COMMIT;
+
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+i              
+
+1              
+step roll: ROLLBACK;
diff --git a/src/test/isolation/expected/truncate.out b/src/test/isolation/expected/truncate.out
new file mode 100644
index 0000000..29c20db
--- /dev/null
+++ b/src/test/isolation/expected/truncate.out
@@ -0,0 +1,63 @@
+Parsed test spec with 2 sessions
+
+starting permutation: begin select roll trunc
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+i              
+
+1              
+step roll: ROLLBACK;
+step trunc: TRUNCATE TABLE foo;
+
+starting permutation: begin select trunc roll
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+i              
+
+1              
+step trunc: TRUNCATE TABLE foo; <waiting ...>
+step roll: ROLLBACK;
+step trunc: <... completed>
+
+starting permutation: begin trunc select roll
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step trunc: TRUNCATE TABLE foo;
+step select: SELECT * FROM foo;
+ERROR:  cannot access recently added or recently removed data in foo
+step roll: ROLLBACK;
+
+starting permutation: trunc begin select roll
+step trunc: TRUNCATE TABLE foo;
+step begin: 
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+
+foo            
+
+1              
+step select: SELECT * FROM foo;
+i              
+
+step roll: ROLLBACK;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 669c0f2..6515824 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -1,3 +1,6 @@
+test: truncate
+test: createtableandinsert
+test: createtableandcopy
 test: simple-write-skew
 test: receipt-report
 test: temporal-range-integrity
diff --git a/src/test/isolation/specs/createtableandcopy.spec b/src/test/isolation/specs/createtableandcopy.spec
new file mode 100644
index 0000000..84e3629
--- /dev/null
+++ b/src/test/isolation/specs/createtableandcopy.spec
@@ -0,0 +1,28 @@
+setup
+{
+  CREATE TABLE copydata AS SELECT generate_series(1,3)::integer AS a;
+  COPY copydata TO '/tmp/createtableandcopy.spec.data';
+}
+
+session "s1"
+step "begin"
+{
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+}
+step "select"	{ SELECT * FROM foo; }
+step "roll"		{ ROLLBACK; }
+
+session "s2"
+step "create"	{
+  BEGIN;
+  CREATE TABLE foo (i int);
+  COPY foo FROM '/tmp/createtableandcopy.spec.data';
+  COMMIT;
+}
+
+teardown
+{
+  DROP TABLE foo, copydata;
+}
diff --git a/src/test/isolation/specs/createtableandinsert.spec b/src/test/isolation/specs/createtableandinsert.spec
new file mode 100644
index 0000000..9a152ab
--- /dev/null
+++ b/src/test/isolation/specs/createtableandinsert.spec
@@ -0,0 +1,22 @@
+session "s1"
+step "begin"
+{
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+}
+step "select"	{ SELECT * FROM foo; }
+step "roll"		{ ROLLBACK; }
+
+session "s2"
+step "create"	{
+  BEGIN;
+  CREATE TABLE foo (i int);
+  INSERT INTO foo VALUES (1);
+  COMMIT;
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
diff --git a/src/test/isolation/specs/truncate.spec b/src/test/isolation/specs/truncate.spec
new file mode 100644
index 0000000..2c7b707
--- /dev/null
+++ b/src/test/isolation/specs/truncate.spec
@@ -0,0 +1,23 @@
+setup
+{
+  CREATE TABLE foo (i int);
+  INSERT INTO foo VALUES (1);
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+step "begin"
+{
+  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+  -- Force snapshot open
+  SELECT 1 AS foo FROM txid_current();
+}
+step "select"	{ SELECT * FROM foo; }
+step "roll"		{ ROLLBACK; }
+
+session "s2"
+step "trunc"	{ TRUNCATE TABLE foo; }
#15Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#14)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Marti, please review this latest version which has new isolation tests added.

This does both TRUNCATE and CREATE TABLE.

I don't see any need for a GUC to control this behavior. The current
behavior is wrong, so if we're going to choose this path to fix it,
then we should just fix it, period. The narrow set of circumstances
in which it might be beneficial to disable this behavior doesn't seem
to me to be sufficient to justify a behavior-changing GUC.

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal(). Surely this is just as much of
a problem for an index-scan or index-only-scan. We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

The actual text of the error message could use some work. Maybe
something like "could not serialize access due to concurrent DDL",
although I think we try to avoid using acronyms like DDL in
translatable strings.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#15)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Marti, please review this latest version which has new isolation tests added.

This does both TRUNCATE and CREATE TABLE.

I don't see any need for a GUC to control this behavior.  The current
behavior is wrong, so if we're going to choose this path to fix it,
then we should just fix it, period.  The narrow set of circumstances
in which it might be beneficial to disable this behavior doesn't seem
to me to be sufficient to justify a behavior-changing GUC.

I agree behaviour is wrong, the only question is whether our users
rely in some way on that behaviour. Given the long discussion on that
point earlier I thought it best to add a GUC. Easy to remove, now or
later.

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal().  Surely this is just as much of
a problem for an index-scan or index-only-scan.

err, very good point. Doh.

We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

OK, that makes sense and works if we need to rebuild relcache.

The actual text of the error message could use some work.  Maybe
something like "could not serialize access due to concurrent DDL",
although I think we try to avoid using acronyms like DDL in
translatable strings.

Yeh that was designed-to-be-replaced text. We do use DDL already
elsewhere without really explaining it; its also one of those acronyms
that doesn't actually explain what it really means very well. So I
like the phrase you suggest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#16)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal().  Surely this is just as much of
a problem for an index-scan or index-only-scan.

err, very good point. Doh.

We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

OK, that makes sense and works if we need to rebuild relcache.

Except the reason to do it at the start of the scan is that is the
first time a specific snapshot has been associated with a relation and
also the last point we can apply the check before the errant behaviour
occurs.

If we reject locks against tables that might be used against an
illegal snapshot then we could easily prevent valid snapshot use cases
when a transaction has multiple snapshots, one illegal, one not.

We can certainly have a looser test when we first get the lock and
then another test later, but I don't think we can avoid making all
scans apply this test. And while I'm there, we have to add tests for
things like index build scans.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#18Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#16)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 11:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I agree behaviour is wrong, the only question is whether our users
rely in some way on that behaviour. Given the long discussion on that
point earlier I thought it best to add a GUC. Easy to remove, now or
later.

AFAICT, all the discussion upthread was about whether we ought to be
trying to implement this in some entirely-different way that doesn't
rely on storing XIDs in the catalog. I have a feeling that there are
a whole lot more cases like this and that we're in for a lot of
unpleasant nastiness if we go very far down this route; pg_constraint
is another one, as SnapshotNow can see constraints that may not be
valid under the query's MVCC snapshot. On the other hand, if someone
comes up with a better way, I suppose we can always rip this out. In
any case, I don't remember anyone saying that this needed to be
configurable.

Speaking of that, though, I have one further thought on this: we need
to be absolutely certain that autovacuum is going to prevent this XID
value from wrapping around. I suppose this is safe since, even if
autovacuum is turned off, we'll forcibly kick it off every so often to
advance relfrozenxid, and that will reset relvalidxid while it's
there. But then again on second thought, what if relvalidxid lags
relfrozenxid? Then the emergency autovacuum might not kick in until
relvalidxid has already wrapped around. I think that could happen
after a TRUNCATE, perhaps, since I think that would leave relfrozenxid
alone while advancing relvalidxid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#17)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 12:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal().  Surely this is just as much of
a problem for an index-scan or index-only-scan.

err, very good point. Doh.

We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

OK, that makes sense and works if we need to rebuild relcache.

Except the reason to do it at the start of the scan is that is the
first time a specific snapshot has been associated with a relation and
also the last point we can apply the check before the errant behaviour
occurs.

If we reject locks against tables that might be used against an
illegal snapshot then we could easily prevent valid snapshot use cases
when a transaction has multiple snapshots, one illegal, one not.

We can certainly have a looser test when we first get the lock and
then another test later, but I don't think we can avoid making all
scans apply this test. And while I'm there, we have to add tests for
things like index build scans.

Well, there's no point that I can see in having two checks. I just
dislike the idea that we have to remember to add this check for every
method of accessing the relation - doesn't seem terribly future-proof.
It gets even worse if you start adding checks to DDL code paths - if
we're going to do that, we really need to cover them all, and that
doesn't seem very practical if they're going to spread out all over
the place.

I don't understand your comment that a snapshot doesn't get associated
with a relation until scan time. I believe we associated a snapshot
with each query before we even know what relations are involved; that
query then gets passed down to all the individual scans. The query
also opens and locks those relations. We ought to be able to arrange
for the query snapshot to be cross-checked at that point, rather than
waiting until scan-start time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#6)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:

On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <noah@leadboat.com> wrote:

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED. ?They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah. Well, that's actually an interesting example, because it
illustrates how general this problem is. We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated.

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin. relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others. The value in extending this to more catalogs is the
ability to narrow the impact of failing the check. A failed indcheckxmin
comparison merely excludes plans involving the index. A failed inhvalidxmin
check might just skip recursion to the table in question. Those are further
refinements, much like using weaker heavyweight lock types.

But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter. There could also be intermediate cases where
updates are invalidating for some purposes but not others. I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

I'm not seeing that problem. Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness. If you have something in mind that
needs more, could you elaborate?

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ. ?I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

The example that I remember was related to SELECT FOR UPDATE/SELECT
FOR SHARE. The idea of those statements is that you want to prevent
the row from being updated or deleted until some other concurrent
action is complete; for example, in the case of a foreign key, we'd
like to prevent the referenced row from being deleted or updated in
the relevant columns until the inserting transaction is committed.
But it doesn't work, because when the updating or deleting process
gets done with the lock wait, they are still using the same snapshot
as before, and merrily do exactly the the thing that the lock-wait was
supposed to prevent. If an actual UPDATE is used, it's safe (I
think): anyone who was going to UPDATE or DELETE the row will fail
with some kind of serialization error. But a SELECT FOR UPDATE that
commits is treated more like an UPDATE that rolls back: it's as if the
lock never existed. Someone (Florian?) proposed a patch to change
this, but it seemed problematic for reasons I no longer exactly
remember.

Thanks. I vaguely remember that thread.

nm

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#19)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 5:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, there's no point that I can see in having two checks.  I just
dislike the idea that we have to remember to add this check for every
method of accessing the relation - doesn't seem terribly future-proof.
 It gets even worse if you start adding checks to DDL code paths - if
we're going to do that, we really need to cover them all, and that
doesn't seem very practical if they're going to spread out all over
the place.

Understood. Will look.

I don't understand your comment that a snapshot doesn't get associated
with a relation until scan time.  I believe we associated a snapshot
with each query before we even know what relations are involved; that
query then gets passed down to all the individual scans.  The query
also opens and locks those relations.  We ought to be able to arrange
for the query snapshot to be cross-checked at that point, rather than
waiting until scan-start time.

What's to stop other code using an older snapshot explicitly? That
fear may be bogus.

Any suggestions? ISTM we don't know whether we're already locked until
we get to LockAcquire() and there's no easy way to pass down snapshot
information through that, let alone handle RI snapshots. Ideas please.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#22Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#20)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:

On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <noah@leadboat.com> wrote:

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED. ?They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah.  Well, that's actually an interesting example, because it
illustrates how general this problem is.  We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated.

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

Well, in theory, we have similar issues if, say, a query uses a
function that didn't exist at the time the snapshot as taken; the
actual results the user sees may not be consistent with any serial
execution schedule. And the same could be true for any other SQL
object. It's unclear that those cases are as compelling as this one,
but then again it's unclear that no one will ever want to fix them,
either. For example, suppose we have a view v over a table t that
calls a function f. Somebody alters f to give different results and,
in the same transaction, modifies the contents of t (but no DDL).
This doesn't strike me as a terribly unlikely scenario; the change to
t could well be envisioned as a compensating transaction. But now if
somebody uses the new definition of f against the old contents of t,
the user may fail to get what they were hoping for out of bundling
those changes together in one transaction.

Now, maybe we're never going to fix those kinds of anomalies anyway,
but if we go with this architecture, then I think the chances of it
ever being palatable to try are pretty low.

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others.  The value in extending this to more catalogs is the
ability to narrow the impact of failing the check.  A failed indcheckxmin
comparison merely excludes plans involving the index.  A failed inhvalidxmin
check might just skip recursion to the table in question.  Those are further
refinements, much like using weaker heavyweight lock types.

Yes, good parallel.

But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter.  There could also be intermediate cases where
updates are invalidating for some purposes but not others.  I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

I'm not seeing that problem.  Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness.  If you have something in mind that
needs more, could you elaborate?

Well, consider something like CLUSTER. It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE. On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly. Maybe not a very important one, but does that prove that
there's no significant problem of this type in general, or just
nobody's thought through all the cases yet? After all, the issues
with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
for a very long time, and we're only just getting around to looking at
them, so I don't have much confidence that there aren't other cases
floating around out there.

I guess another way to put this is that you could need "locks" of a
great number of different strengths to really handle all the cases.
It's going to be unappealing to, say, set the relation xmin when
setting the constraint xmin would do, or to fail for a concurrent
TRUNCATE as well as a concurrent CREATE TABLE when only the latter
logically requires a failure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#22)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 5, 2012 at 8:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others.  The value in extending this to more catalogs is the
ability to narrow the impact of failing the check.  A failed indcheckxmin
comparison merely excludes plans involving the index.  A failed inhvalidxmin
check might just skip recursion to the table in question.  Those are further
refinements, much like using weaker heavyweight lock types.

Yes, good parallel.

Did you guys get my comment about not being able to use an xmin value,
we have to use an xid value and to a an XidInMVCCSnapshot() test? Just
checking whether you agree/disagree.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#24Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#13)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sun, Mar 04, 2012 at 01:02:57PM +0000, Simon Riggs wrote:

More detailed thoughts show that the test in heap_beginscan_internal()
is not right enough, i.e. wrong.

We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
needs to be a specific xid, not an xmin because otherwise we can get
concurrent transactions failing, not just older transactions.

Good point; I agree. indcheckxmin's level of pessimism isn't appropriate for
this new check.

If we're going freeze tuples on load this needs to be watertight, so
some minor rework needed.

Of course, if we only have a valid xid on the class we might get new
columns added when we do repeated SELECT * statements using the same
snapshot while concurrent DDL occurs. That is impractical, so if we
define this as applying to rows it can work; if we want it to apply to
everything its getting more difficult.

Excess columns seem less grave to me than excess or missing rows. I'm having
difficulty thinking up an explanation for that opinion.

#25Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#22)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Mon, Mar 05, 2012 at 03:46:16PM -0500, Robert Haas wrote:

On Mon, Mar 5, 2012 at 2:22 PM, Noah Misch <noah@leadboat.com> wrote:

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

Well, in theory, we have similar issues if, say, a query uses a
function that didn't exist at the time the snapshot as taken; the
actual results the user sees may not be consistent with any serial
execution schedule. And the same could be true for any other SQL
object. It's unclear that those cases are as compelling as this one,
but then again it's unclear that no one will ever want to fix them,
either. For example, suppose we have a view v over a table t that
calls a function f. Somebody alters f to give different results and,
in the same transaction, modifies the contents of t (but no DDL).
This doesn't strike me as a terribly unlikely scenario; the change to
t could well be envisioned as a compensating transaction. But now if
somebody uses the new definition of f against the old contents of t,
the user may fail to get what they were hoping for out of bundling
those changes together in one transaction.

Good example.

Now, maybe we're never going to fix those kinds of anomalies anyway,
but if we go with this architecture, then I think the chances of it
ever being palatable to try are pretty low.

Why?

But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter. ?There could also be intermediate cases where
updates are invalidating for some purposes but not others. ?I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

I'm not seeing that problem. ?Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness. ?If you have something in mind that
needs more, could you elaborate?

Simon's point about xmin vs. xid probably leads to an example. One value is
fine for TRUNCATE, because only the most recent TRUNCATE matters. Not all DDL
is so simple.

Well, consider something like CLUSTER. It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE. On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly.

Core CLUSTER does not use any MVCC snapshot. We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot. Could you elaborate this example?

Maybe not a very important one, but does that prove that
there's no significant problem of this type in general, or just
nobody's thought through all the cases yet? After all, the issues
with CREATE TABLE/TRUNCATE vs. a concurrent SELECT have been around
for a very long time, and we're only just getting around to looking at
them, so I don't have much confidence that there aren't other cases
floating around out there.

Granted.

Thanks,
nm

#26Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#25)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch <noah@leadboat.com> wrote:

Now, maybe we're never going to fix those kinds of anomalies anyway,
but if we go with this architecture, then I think the chances of it
ever being palatable to try are pretty low.

Why?

Because it'll require at least one XID column in every system catalog
that represents an SQL catalog to plug all the cases, and I doubt very
much that we want to go there.

Simon's point about xmin vs. xid probably leads to an example.  One value is
fine for TRUNCATE, because only the most recent TRUNCATE matters.  Not all DDL
is so simple.

Yep.

Well, consider something like CLUSTER.  It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE.  On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly.

Core CLUSTER does not use any MVCC snapshot.  We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot.  Could you elaborate this example?

Imagine this:

- Transaction #1 acquires a snapshot.
- Transaction #2 creates tables A, inserts a row into table B, and then commits.
- Transaction #1 tries to CLUSTER A and then select from B.

The only serial execution schedules are T1 < T2, in which case the
transaction fails, or T2 < T1, in which case the row is seen. But
what actually happens is that the row isn't seen and yet the
transaction doesn't fail.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#26)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:

On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch <noah@leadboat.com> wrote:

Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly.

Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot. ?Could you elaborate this example?

Imagine this:

- Transaction #1 acquires a snapshot.
- Transaction #2 creates tables A, inserts a row into table B, and then commits.
- Transaction #1 tries to CLUSTER A and then select from B.

The only serial execution schedules are T1 < T2, in which case the
transaction fails, or T2 < T1, in which case the row is seen. But
what actually happens is that the row isn't seen and yet the
transaction doesn't fail.

For the purpose of contemplating this anomaly, one could just as well replace
CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
table, correct?

I agree this test case is good to keep in mind while designing, but we could
well conclude not to bother improving it.

#28Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#27)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 7:49 AM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Mar 06, 2012 at 08:36:05AM -0500, Robert Haas wrote:

On Tue, Mar 6, 2012 at 5:43 AM, Noah Misch <noah@leadboat.com> wrote:

Well, consider something like CLUSTER. ?It's perfectly OK for CLUSTER
to operate on a table that has been truncated since CLUSTER's snapshot
was taken, and no serialization anomaly is created that would not have
already existed as a result of the non-MVCC-safe TRUNCATE. ?On the
other hand, if CLUSTER operates on a table that was created since
CLUSTER's snapshot was taken, then you have a bona fide serialization
anomaly.

Core CLUSTER does not use any MVCC snapshot. ?We do push one for the benefit
of functions called during the reindex phase, but it does not appear that you
speak of that snapshot. ?Could you elaborate this example?

Imagine this:

- Transaction #1 acquires a snapshot.
- Transaction #2 creates tables A, inserts a row into table B, and then commits.
- Transaction #1 tries to CLUSTER A and then select from B.

The only serial execution schedules are T1 < T2, in which case the
transaction fails, or T2 < T1, in which case the row is seen.  But
what actually happens is that the row isn't seen and yet the
transaction doesn't fail.

For the purpose of contemplating this anomaly, one could just as well replace
CLUSTER with GRANT, COMMENT ON TABLE, or any other command that operates on a
table, correct?

I agree this test case is good to keep in mind while designing, but we could
well conclude not to bother improving it.

All true.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#28)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

All true.

So gentlemen, do we think this is worth pursuing further for this release?

I'm sure usual arguments apply all round, so I'll skip that part.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#30Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#29)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 10:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

All true.

So gentlemen, do we think this is worth pursuing further for this release?

I'm sure usual arguments apply all round, so I'll skip that part.

This patch is awfully late to the party, but if we can nail it down
reasonably quickly I guess I'd be in favor of slipping something in.
I am not thrilled with the design as it stands, but bulk loading is a
known and serious pain point for us, so it would be awfully nice to
improve it. I'm not sure whether we should only go as far as setting
HEAP_XMIN_COMMITTED or whether we should actually try to mark the
tuples with FrozenXID. The former has the advantage of (I think) not
requiring any other changes to preserve MVCC semantics while the
latter is, obviously, a bigger performance improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#30)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 5:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2012 at 10:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Mar 7, 2012 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

All true.

So gentlemen, do we think this is worth pursuing further for this release?

I'm sure usual arguments apply all round, so I'll skip that part.

This patch is awfully late to the party, but if we can nail it down
reasonably quickly I guess I'd be in favor of slipping something in.

Cool

I am not thrilled with the design as it stands, but bulk loading is a
known and serious pain point for us, so it would be awfully nice to
improve it.  I'm not sure whether we should only go as far as setting
HEAP_XMIN_COMMITTED or whether we should actually try to mark the
tuples with FrozenXID.  The former has the advantage of (I think) not
requiring any other changes to preserve MVCC semantics while the
latter is, obviously, a bigger performance improvement.

It's the other way around. Setting to FrozenTransactionId makes the
test in XidInMVCCSnapshot() pass when accessed by later commands in
the same transaction. If we just set the hint we need to play around
to get it accepted. So the frozen route is both best for performance
and least impact on fastpath visibility code. That part of the code is
solid.

The only problem is the visibility from older snapshots, we just
need/desire a better place to put the test. So I'll finish that off.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#32Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#31)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I am not thrilled with the design as it stands, but bulk loading is a
known and serious pain point for us, so it would be awfully nice to
improve it.  I'm not sure whether we should only go as far as setting
HEAP_XMIN_COMMITTED or whether we should actually try to mark the
tuples with FrozenXID.  The former has the advantage of (I think) not
requiring any other changes to preserve MVCC semantics while the
latter is, obviously, a bigger performance improvement.

It's the other way around. Setting to FrozenTransactionId makes the
test in XidInMVCCSnapshot() pass when accessed by later commands in
the same transaction. If we just set the hint we need to play around
to get it accepted. So the frozen route is both best for performance
and least impact on fastpath visibility code. That part of the code is
solid.

Your comment is reminding me that there are actually two problems
here, or at least I think there are.

1. Some other transaction might look at the tuples.
2. An older snapshot (e.g. cursor) might look at the tuples.

Case #1 can happen when we create a table, insert some data, and
commit, and then some other transaction that took a snapshot before we
committed reads the table. It's OK if the tuples are marked
HEAP_XMIN_COMMITTED, because if we abort no other transaction will
ever see the new pg_class row as alive, and therefore no other
transaction can examine the table contents. But using FrozenXID as
the tuple xmin would allow those tuples to be seen by a transaction
that took its snapshot before we committed; this is the problem that
relvalidxid is designed to fix, and what I was thinking of when I said
that we need more infrastructure to handle the FrozenXID case.

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots. And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED. I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around. I'm not
sure whether the patch does that or not, but I think it probably needs
to, unless you have some other idea for how to fix this. It doesn't
seem like an important restriction in practice because it's unlikely
that anyone would keep a cursor open across a bulk data load - and if
they do, this isn't the only problem they're going to have.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#32)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 8:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2012 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I am not thrilled with the design as it stands, but bulk loading is a
known and serious pain point for us, so it would be awfully nice to
improve it.  I'm not sure whether we should only go as far as setting
HEAP_XMIN_COMMITTED or whether we should actually try to mark the
tuples with FrozenXID.  The former has the advantage of (I think) not
requiring any other changes to preserve MVCC semantics while the
latter is, obviously, a bigger performance improvement.

It's the other way around. Setting to FrozenTransactionId makes the
test in XidInMVCCSnapshot() pass when accessed by later commands in
the same transaction. If we just set the hint we need to play around
to get it accepted. So the frozen route is both best for performance
and least impact on fastpath visibility code. That part of the code is
solid.

Your comment is reminding me that there are actually two problems
here, or at least I think there are.

1. Some other transaction might look at the tuples.
2. An older snapshot (e.g. cursor) might look at the tuples.

Case #1 can happen when we create a table, insert some data, and
commit, and then some other transaction that took a snapshot before we
committed reads the table.  It's OK if the tuples are marked
HEAP_XMIN_COMMITTED, because if we abort no other transaction will
ever see the new pg_class row as alive, and therefore no other
transaction can examine the table contents.  But using FrozenXID as
the tuple xmin would allow those tuples to be seen by a transaction
that took its snapshot before we committed; this is the problem that
relvalidxid is designed to fix, and what I was thinking of when I said
that we need more infrastructure to handle the FrozenXID case.

Yes. If your purpose was to summarise, then yes that's where we're at.

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots.  And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED.

The problem there is that later subtransactions often have xids that
are greater than xmax, so the xid shows as running when we do
XidInMVCCSnapshot(), which must then be altered for this one weird
case. I tried that and not happy with result.

I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around.  I'm not
sure whether the patch does that or not, but I think it probably needs
to

It does. I thought you already read the patch?

, unless you have some other idea for how to fix this.  It doesn't
seem like an important restriction in practice because it's unlikely
that anyone would keep a cursor open across a bulk data load - and if
they do, this isn't the only problem they're going to have.

Exactly.

So we're all good, apart from deciding the exact place to prevent
other transaction's older snapshots from seeing the newly frozen rows.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#34Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#33)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots.  And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED.

The problem there is that later subtransactions often have xids that
are greater than xmax, so the xid shows as running when we do
XidInMVCCSnapshot(), which must then be altered for this one weird
case. I tried that and not happy with result.

Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
confess I don't quite follow what you're describing here otherwise.

I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around.  I'm not
sure whether the patch does that or not, but I think it probably needs
to

It does. I thought you already read the patch?

I glanced over it, but did not look through it in detail. I'll do a
more careful look at your next version.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#34)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Fri, Mar 9, 2012 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots.  And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED.

The problem there is that later subtransactions often have xids that
are greater than xmax, so the xid shows as running when we do
XidInMVCCSnapshot(), which must then be altered for this one weird
case. I tried that and not happy with result.

Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
confess I don't quite follow what you're describing here otherwise.

I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around.  I'm not
sure whether the patch does that or not, but I think it probably needs
to

It does. I thought you already read the patch?

I glanced over it, but did not look through it in detail.  I'll do a
more careful look at your next version.

I'm not confident about the restrictions this patch imposes and we
aren't close enough to a final version for me to honestly request this
be considered for this release. I think its time to close this door
for now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#36Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#35)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Fri, Mar 9, 2012 at 2:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Mar 9, 2012 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 7, 2012 at 5:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Case #2 is certainly a problem for FrozenXID as well, because anything
that's marked with FrozenXID is going to look visible to everybody,
including our older snapshots.  And I gather you're saying it's also a
problem for HEAP_XMIN_COMMITTED.

The problem there is that later subtransactions often have xids that
are greater than xmax, so the xid shows as running when we do
XidInMVCCSnapshot(), which must then be altered for this one weird
case. I tried that and not happy with result.

Altering XidInMVCCSnapshot() seems like a good thing to avoid, but I
confess I don't quite follow what you're describing here otherwise.

I had assumed that the way we were
fixing this problem was to disable these optimizations for
transactions that had more than one snapshot floating around.  I'm not
sure whether the patch does that or not, but I think it probably needs
to

It does. I thought you already read the patch?

I glanced over it, but did not look through it in detail.  I'll do a
more careful look at your next version.

I'm not confident about the restrictions this patch imposes and we
aren't close enough to a final version for me to honestly request this
be considered for this release. I think its time to close this door
for now.

OK, makes sense. I was trying to hold my nose, because I really would
like to see this stuff work better than it does, but I had my doubts,
too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#14)
Re: RFC: Making TRUNCATE more "MVCC-safe"

On Sun, 2012-03-04 at 16:39 +0000, Simon Riggs wrote:

v3 attached.

Added to next commitfest.

Regards,
Jeff Davis