Failed recovery with new faster 2PC code

Started by Jeff Janesover 8 years ago13 messages
#1Jeff Janes
jeff.janes@gmail.com
3 attachment(s)

After this commit, I get crash recovery failures when using prepared
transactions.

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

After the induced crash, I get this failure in recovery:

FATAL: could not access status of transaction 334419347
DETAIL: Could not open file "pg_xact/013E": No such file or directory.
LOG: startup process (PID 60106) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down

The earliest file which exists in pg_xact is 0176

Other examples:

FATAL: could not access status of transaction 121729737
DETAIL: Could not open file "pg_xact/0074": No such file or directory.
LOG: startup process (PID 23720) exited with exit code 1

FATAL: could not access status of transaction 181325554
DETAIL: Could not open file "pg_xact/00AC": No such file or directory.
LOG: startup process (PID 8375) exited with exit code 1

I experience this in about 1 out of 15 crash-recovery cycles on 8 CPUs.

The patch Pavan posted here did not make any difference:

/messages/by-id/CABOikdMdhS4nYX7xHaF+m=P=q_zAJBCYsZ++VN26AZzDRf_xFA@mail.gmail.com

I've attached the test harness, which I think will look familiar to y'all.
It is the usual injection of torn-page-write crashes with consistency
checks after recovery (which makes no difference, as the issue is that
recovery does not happen), modified to include a very crude transaction
manager to make use of 2PC.

Cheers,

Jeff

Attachments:

count.plapplication/octet-stream; name=count.plDownload
do.shapplication/x-sh; name=do.shDownload
crash_REL10.patchapplication/octet-stream; name=crash_REL10.patchDownload
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
new file mode 100644
index 5efbfbd..f7f8a6d
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 33,38 ****
--- 33,41 ----
  /* pointer to "variable cache" in shared memory (set up by shmem.c) */
  VariableCache ShmemVariableCache = NULL;
  
+ int JJ_xid=0;
+ extern int	JJ_vac;
+ 
  
  /*
   * Allocate the next XID for a new transaction or subtransaction.
*************** GetNewTransactionId(bool isSubXact)
*** 168,173 ****
--- 171,181 ----
  	 *
  	 * Extend pg_subtrans and pg_commit_ts too.
  	 */
+ 	{
+ 	int		incr;
+ 	for (incr=0; incr <=JJ_xid; incr++)
+ 	{
+ 	xid = ShmemVariableCache->nextXid;
  	ExtendCLOG(xid);
  	ExtendCommitTs(xid);
  	ExtendSUBTRANS(xid);
*************** GetNewTransactionId(bool isSubXact)
*** 179,184 ****
--- 187,194 ----
  	 * more XIDs until there is CLOG space for them.
  	 */
  	TransactionIdAdvance(ShmemVariableCache->nextXid);
+ 	}
+ 	}
  
  	/*
  	 * We must store the new XID into the shared ProcArray before releasing
*************** SetTransactionIdLimit(TransactionId olde
*** 363,370 ****
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(DEBUG1,
! 			(errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
--- 373,380 ----
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(LOG,
! 			(errmsg("JJ transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
*************** ForceTransactionIdLimitUpdate(void)
*** 441,446 ****
--- 451,467 ----
  	oldestXidDB = ShmemVariableCache->oldestXidDB;
  	LWLockRelease(XidGenLock);
  
+ 	if (JJ_vac) {
+ 	elog(LOG,"JJ ForceTransactionIdLimitUpdate in %d: !normal %d, !valid %d, follows %d (%u, %u), !exists %d", MyDatabaseId,
+ 		!TransactionIdIsNormal(oldestXid),
+ 		!TransactionIdIsValid(xidVacLimit),
+ 		TransactionIdFollowsOrEquals(nextXid, xidVacLimit),
+ 		nextXid, xidVacLimit,
+  		!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(oldestXidDB))
+ 	);
+ 	};
+ 
+ 
  	if (!TransactionIdIsNormal(oldestXid))
  		return true;			/* shouldn't happen, but just in case */
  	if (!TransactionIdIsValid(xidVacLimit))
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index b99ded5..f3ad6e2
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** BootStrapXLOG(void)
*** 5017,5022 ****
--- 5017,5023 ----
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	AdvanceOldestClogXid(checkPoint.oldestXid);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
  	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
*************** StartupXLOG(void)
*** 6624,6629 ****
--- 6625,6631 ----
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	AdvanceOldestClogXid(checkPoint.oldestXid);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
  	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
*************** xlog_redo(XLogReaderState *record)
*** 9623,9628 ****
--- 9625,9632 ----
  
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
+ 
  		/*
  		 * No need to set oldestClogXid here as well; it'll be set when we
  		 * redo an xl_clog_truncate if it changed since initialization.
*************** xlog_redo(XLogReaderState *record)
*** 9724,9729 ****
--- 9728,9734 ----
  		 */
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ maybe SetTransactionIDLimit %d", checkPoint.oldestXid);
  		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
  								  checkPoint.oldestXid))
  			SetTransactionIdLimit(checkPoint.oldestXid,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
new file mode 100644
index 9fbb0eb..91ec92c
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** int			vacuum_freeze_min_age;
*** 59,64 ****
--- 59,66 ----
  int			vacuum_freeze_table_age;
  int			vacuum_multixact_freeze_min_age;
  int			vacuum_multixact_freeze_table_age;
+ int			JJ_vac=0;
+ 
  
  
  /* A few variables that don't seem worth passing around as parameters */
*************** vacuum_set_xid_limits(Relation rel,
*** 568,573 ****
--- 570,576 ----
  	}
  
  	*freezeLimit = limit;
+ 	if (JJ_vac) elog(LOG,"JJ freezeLimit %d", *freezeLimit);
  
  	/*
  	 * Compute the multixact age for which freezing is urgent.  This is
*************** vacuum_set_xid_limits(Relation rel,
*** 622,627 ****
--- 625,632 ----
  		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
  		 * before anti-wraparound autovacuum is launched.
  		 */
+ 		if (JJ_vac) elog(LOG,"JJ freeze_min_age %d vacuum_freeze_table_age %d freeze_table_age %d ReadNew %d", freeze_min_age, 
+                            vacuum_freeze_table_age, freeze_table_age,ReadNewTransactionId());
  		freezetable = freeze_table_age;
  		if (freezetable < 0)
  			freezetable = vacuum_freeze_table_age;
*************** vac_update_datfrozenxid(void)
*** 1059,1064 ****
--- 1064,1070 ----
  	 * truncate pg_xact and/or pg_multixact.  Also do it if the shared
  	 * XID-wrap-limit info is stale, since this action will update that too.
  	 */
+ 	if (JJ_vac && dirty) elog(LOG,"JJ updating in %d without call to ForceTransactionIdLimitUpdate", MyDatabaseId);
  	if (dirty || ForceTransactionIdLimitUpdate())
  		vac_truncate_clog(newFrozenXid, newMinMulti,
  						  lastSaneFrozenXid, lastSaneMinMulti);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index b74e493..041cffd
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 64,69 ****
--- 64,70 ----
  #include "utils/tqual.h"
  
  
+ extern int JJ_vac;
  /*
   * Space/time tradeoff parameters: do these need to be user-tunable?
   *
*************** lazy_vacuum_rel(Relation onerel, int opt
*** 243,248 ****
--- 244,251 ----
  	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
  		aggressive = true;
  
+ 	if (JJ_vac) elog(LOG,"JJ aggresive %d, relfrozenid %d", aggressive, onerel->rd_rel->relfrozenxid);
+ 
  	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
  
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
new file mode 100644
index 33ca749..2dd9df2
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int			autovacuum_vac_cost_delay;
*** 122,127 ****
--- 122,128 ----
  int			autovacuum_vac_cost_limit;
  
  int			Log_autovacuum_min_duration = -1;
+ extern int  JJ_vac;
  
  /* how long to keep pgstat data in the launcher, in milliseconds */
  #define STATS_READ_DELAY 1000
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1636,1643 ****
  		InitPostgres(NULL, dbid, NULL, InvalidOid, dbname);
  		SetProcessingMode(NormalProcessing);
  		set_ps_display(dbname, false);
- 		ereport(DEBUG1,
- 				(errmsg("autovacuum: processing database \"%s\"", dbname)));
  
  		if (PostAuthDelay)
  			pg_usleep(PostAuthDelay * 1000000L);
--- 1637,1642 ----
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1645,1651 ****
--- 1644,1654 ----
  		/* And do an appropriate amount of work */
  		recentXid = ReadNewTransactionId();
  		recentMulti = ReadNextMultiXactId();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,recentXid,recentMulti)));
  		do_autovacuum();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: done processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,ReadNewTransactionId(),ReadNextMultiXactId())));
  	}
  
  	/*
*************** relation_needs_vacanalyze(Oid relid,
*** 2856,2868 ****
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
- 		elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
- 			 NameStr(classForm->relname),
- 			 vactuples, vacthresh, anltuples, anlthresh);
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
  	}
  	else
  	{
--- 2859,2872 ----
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
+ 
+ 		if (JJ_vac) elog(LOG, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f) wraparound %d dovaccum %d doanalyze %d",
+ 			 NameStr(classForm->relname),
+ 			 vactuples, vacthresh, anltuples, anlthresh, *wraparound, *dovacuum, *doanalyze);
  	}
  	else
  	{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
new file mode 100644
index b0b596d..b903ddb
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 68,73 ****
--- 68,75 ----
  #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
  #endif
  
+ int JJ_torn_page=0;
+ 
  /*
   *	The magnetic disk storage manager keeps track of open file
   *	descriptors in its own descriptor pool.  This is done to make it
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 805,810 ****
--- 807,813 ----
  	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
+         static int counter=0;
  
  	/* This assert is too expensive to have on normally ... */
  #ifdef CHECK_WRITE_VS_EXTEND
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 830,836 ****
--- 833,850 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
+         if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) {
+ 		nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3, WAIT_EVENT_DATA_FILE_WRITE);
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_DISK_FULL),
+ 				 errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
+ 						blocknum,
+ 						relpath(reln->smgr_rnode, forknum),
+ 						nbytes, BLCKSZ),
+ 				 errhint("JJ is screwing with the database.")));
+         } else {
  	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ 	}
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 291bf76..f81e7dd
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 111,116 ****
--- 111,119 ----
  /* XXX these should appear in other modules' header files */
  extern bool Log_disconnections;
  extern int	CommitDelay;
+ int	JJ_torn_page;
+ extern int	JJ_xid;
+ extern int	JJ_vac;
  extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
*************** static struct config_int ConfigureNamesI
*** 2368,2373 ****
--- 2371,2403 ----
  	},
  
  	{
+ 		{"JJ_torn_page", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Simulate a torn-page crash after this number of page writes (0 to turn off)"),
+ 			NULL
+ 		},
+ 		&JJ_torn_page,
+ 		0, 0, 100000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_xid", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Skip this many xid every time we acquire one"),
+ 			NULL
+ 		},
+ 		&JJ_xid,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_vac", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("turn on verbose logging"),
+ 			NULL
+ 		},
+ 		&JJ_vac,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
  		{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
  			gettext_noop("Sets the minimum concurrent open transactions before performing "
  						 "commit_delay."),
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
new file mode 100644
index f3b3529..ef4c90d
*** a/src/include/pg_config_manual.h
--- b/src/include/pg_config_manual.h
***************
*** 300,306 ****
  /*
   * Enable debugging print statements for lock-related operations.
   */
! /* #define LOCK_DEBUG */
  
  /*
   * Enable debugging print statements for WAL-related operations; see
--- 300,306 ----
  /*
   * Enable debugging print statements for lock-related operations.
   */
! #define LOCK_DEBUG 1
  
  /*
   * Enable debugging print statements for WAL-related operations; see
#2Simon Riggs
simon@2ndquadrant.com
In reply to: Jeff Janes (#1)
Re: Failed recovery with new faster 2PC code

On 15 April 2017 at 23:37, Jeff Janes <jeff.janes@gmail.com> wrote:

After this commit, I get crash recovery failures when using prepared
transactions.

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

Thanks Jeff for your tests.

So that's now two crash bugs in as many days and lack of clarity about
how to fix it.

Stas, I thought this patch was very important to you, yet two releases
in a row we are too-late-and-buggy.

If anybody has a reason why I shouldn't revert this, please say so now
fairly soon.

Any further attempts to fix must run Jeff's tests.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Simon Riggs (#2)
Re: Failed recovery with new faster 2PC code

On 17 Apr 2017, at 12:14, Simon Riggs <simon@2ndquadrant.com> wrote:

On 15 April 2017 at 23:37, Jeff Janes <jeff.janes@gmail.com> wrote:

After this commit, I get crash recovery failures when using prepared
transactions.

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

Thanks Jeff for your tests.

So that's now two crash bugs in as many days and lack of clarity about
how to fix it.

Stas, I thought this patch was very important to you, yet two releases
in a row we are too-late-and-buggy.

I’m looking at pgstat issue in nearby thread right now and will switch to this
shortly.

If that’s possible, I’m asking to delay revert for several days.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Stas Kelvich (#3)
Re: Failed recovery with new faster 2PC code

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

Thanks Jeff for your tests.

So that's now two crash bugs in as many days and lack of clarity about
how to fix it.

I am trying to reproduce and debug it from my end as well.

Regards,
Nikhils

#5Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Nikhil Sontakke (#4)
Re: Failed recovery with new faster 2PC code

On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal

path

Thanks Jeff for your tests.

So that's now two crash bugs in as many days and lack of clarity about
how to fix it.

The issue seems to be that a prepared transaction is yet to be committed.

But autovacuum comes in and causes the clog to be truncated beyond this
prepared transaction ID in one of the runs.

We only add the corresponding pgproc entry for a surviving 2PC transaction
on completion of recovery. So could be a race condition here. Digging in
further.

Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

#6Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Nikhil Sontakke (#5)
2 attachment(s)
Re: Failed recovery with new faster 2PC code

Hi,

There was a bug in the redo 2PC remove code path. Because of which, autovac
would think that the 2PC is gone and cause removal of the corresponding
clog entry earlier than needed.

Please find attached, the bug fix: 2pc_redo_remove_bug.patch.

I have been testing this on top of Michael's 2pc-restore-fix.patch and
things seem to be ok for the past one+ hour. Will keep it running for long.

Jeff, thanks for these very useful scripts. I am going to make a habit to
run these scripts on my side from now on. Do you have any other script that
I could try against these patches? Please let me know.

Regards,
Nikhils

On 18 April 2017 at 12:09, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:

On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal

path

Thanks Jeff for your tests.

So that's now two crash bugs in as many days and lack of clarity about
how to fix it.

The issue seems to be that a prepared transaction is yet to be committed.

But autovacuum comes in and causes the clog to be truncated beyond this
prepared transaction ID in one of the runs.

We only add the corresponding pgproc entry for a surviving 2PC transaction
on completion of recovery. So could be a race condition here. Digging in
further.

Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Attachments:

2pc_redo_remove_bug.patchapplication/octet-stream; name=2pc_redo_remove_bug.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7982e16abc..0eba9b67b9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2393,6 +2393,8 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 			Assert(gxact->inredo);
 			break;
 		}
+		else
+			gxact = NULL;
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
2pc-restore-fix.patchapplication/octet-stream; name=2pc-restore-fix.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7982e16abc..723552bfb2 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -222,7 +222,7 @@ static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
 static char *ProcessTwoPhaseBuffer(TransactionId xid,
 							XLogRecPtr	prepare_start_lsn,
 							bool fromdisk, bool overwriteOK, bool setParent,
-							TransactionId *result, TransactionId *maxsubxid);
+							bool setNextXid);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 				const char *gid, TimestampTz prepared_at, Oid owner,
 				Oid databaseid);
@@ -1744,7 +1744,7 @@ restoreTwoPhaseData(void)
 
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false,
-										NULL, NULL);
+										false);
 			if (buf == NULL)
 				continue;
 
@@ -1786,7 +1786,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
 	TransactionId result = origNextXid;
-	TransactionId maxsubxid = origNextXid;
 	TransactionId *xids = NULL;
 	int			nxids = 0;
 	int			allocsize = 0;
@@ -1806,11 +1805,18 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, false, false,
-				&result, &maxsubxid);
+				true);
 
 		if (buf == NULL)
 			continue;
 
+		/*
+		 * OK, we think this file is valid.  Incorporate xid into the
+		 * running-minimum result.
+		 */
+		if (TransactionIdPrecedes(xid, result))
+			result = xid;
+
 		if (xids_p)
 		{
 			if (nxids == allocsize)
@@ -1839,15 +1845,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		*nxids_p = nxids;
 	}
 
-	/* update nextXid if needed */
-	if (TransactionIdFollowsOrEquals(maxsubxid, ShmemVariableCache->nextXid))
-	{
-		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextXid = maxsubxid;
-		TransactionIdAdvance(ShmemVariableCache->nextXid);
-		LWLockRelease(XidGenLock);
-	}
-
 	return result;
 }
 
@@ -1884,7 +1881,7 @@ StandbyRecoverPreparedTransactions(bool overwriteOK)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, overwriteOK, true,
-				NULL, NULL);
+				false);
 		if (buf != NULL)
 			pfree(buf);
 	}
@@ -1924,7 +1921,7 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, false, false,
-				NULL, NULL);
+				false);
 		if (buf == NULL)
 			continue;
 
@@ -2012,20 +2009,16 @@ RecoverPreparedTransactions(void)
  * If setParent is true, then use the overwriteOK parameter to set up
  * subtransaction parent linkages.
  *
- * If result and maxsubxid are not NULL, fill them up with smallest
- * running transaction id (lesser than ShmemVariableCache->nextXid)
- * and largest subtransaction id for this transaction respectively.
+ * If setNextXid is true, set ShmemVariableCache->nextXid to the newest
+ * value scanned.
  */
 static char *
 ProcessTwoPhaseBuffer(TransactionId xid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk, bool overwriteOK,
-					  bool setParent, TransactionId *result,
-					  TransactionId *maxsubxid)
+					  bool setParent, bool setNextXid)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
-	TransactionId res = InvalidTransactionId;
-	TransactionId maxsub = InvalidTransactionId;
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2034,11 +2027,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	if (result)
-		res = *result;
-	if (maxsubxid)
-		maxsub = *maxsubxid;
-
 	/* Already processed? */
 	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
 	{
@@ -2121,13 +2109,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	}
 
 	/*
-	 * OK, we think this buffer is valid.  Incorporate xid into the
-	 * running-minimum result.
-	 */
-	if (TransactionIdPrecedes(xid, res))
-		res = xid;
-
-	/*
 	 * Examine subtransaction XIDs ... they should all follow main
 	 * XID, and they may force us to advance nextXid.
 	 */
@@ -2139,17 +2120,27 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		TransactionId subxid = subxids[i];
 
 		Assert(TransactionIdFollows(subxid, xid));
-		if (TransactionIdFollowsOrEquals(subxid, maxsub))
-			maxsub = subxid;
+
+		/* update nextXid if needed */
+		if (setNextXid &&
+			TransactionIdFollowsOrEquals(subxid,
+										 ShmemVariableCache->nextXid))
+		{
+			/*
+			 * We don't expect anyone else to modify nextXid, hence we don't
+			 * need to hold a lock while examining it.  We still acquire the
+			 * lock to modify it, though.
+			 */
+			LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+			ShmemVariableCache->nextXid = subxid;
+			TransactionIdAdvance(ShmemVariableCache->nextXid);
+			LWLockRelease(XidGenLock);
+		}
+
 		if (setParent)
 			SubTransSetParent(xid, subxid, overwriteOK);
 	}
 
-	if (result)
-		*result = res;
-	if (maxsubxid)
-		*maxsubxid = maxsub;
-
 	return buf;
 }
 
#7Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Nikhil Sontakke (#6)
1 attachment(s)
Re: Failed recovery with new faster 2PC code

Please find attached a second version of my bug fix which is stylistically
better and clearer than the first one.

Regards,
Nikhils

On 18 April 2017 at 13:47, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:

Hi,

There was a bug in the redo 2PC remove code path. Because of which,
autovac would think that the 2PC is gone and cause removal of the
corresponding clog entry earlier than needed.

Please find attached, the bug fix: 2pc_redo_remove_bug.patch.

I have been testing this on top of Michael's 2pc-restore-fix.patch and
things seem to be ok for the past one+ hour. Will keep it running for long.

Jeff, thanks for these very useful scripts. I am going to make a habit to
run these scripts on my side from now on. Do you have any other script that
I could try against these patches? Please let me know.

Regards,
Nikhils

On 18 April 2017 at 12:09, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:

On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal

path

Thanks Jeff for your tests.

So that's now two crash bugs in as many days and lack of clarity about
how to fix it.

The issue seems to be that a prepared transaction is yet to be

committed. But autovacuum comes in and causes the clog to be truncated
beyond this prepared transaction ID in one of the runs.

We only add the corresponding pgproc entry for a surviving 2PC
transaction on completion of recovery. So could be a race condition here.
Digging in further.

Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Attachments:

2pc_redo_remove_bug_v2.0.patchapplication/octet-stream; name=2pc_redo_remove_bug_v2.0.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7982e16abc..f1ba5ac1af 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2380,6 +2380,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 {
 	GlobalTransaction gxact = NULL;
 	int			i;
+	bool		found = false;
 
 	Assert(RecoveryInProgress());
 
@@ -2391,6 +2392,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		if (gxact->xid == xid)
 		{
 			Assert(gxact->inredo);
+			found = true;
 			break;
 		}
 	}
@@ -2399,7 +2401,7 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	/*
 	 * Just leave if there is nothing, this is expected during WAL replay.
 	 */
-	if (gxact == NULL)
+	if (!found)
 		return;
 
 	/*
#8Simon Riggs
simon@2ndquadrant.com
In reply to: Nikhil Sontakke (#7)
Re: Failed recovery with new faster 2PC code

On 18 April 2017 at 09:57, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:

Please find attached a second version of my bug fix which is stylistically
better and clearer than the first one.

Yeh, this is better. Pushed.

The bug was that the loop set gxact to be the last entry in the array,
causing the exit condition to fail and us then to remove the last
gxact from memory even when it didn't match the xid, removing a valid
entry too early. That then allowed xmin to move forwards, which causes
autovac to remove pg_xact entries earlier than needed.

Well done for finding that one, thanks for the patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#8)
Re: Failed recovery with new faster 2PC code

On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yeh, this is better. Pushed.

I have been outraced on this one, the error is obvious once you see it ;)

Thanks for the investigation and the fix! I have spent a couple of
hours reviewing the interactions between the shmem entries of 2PC
state data created at the beginning of recovery and all the lookups in
procarray.c and varsup.c, noticing nothing by the way.

The bug was that the loop set gxact to be the last entry in the array,
causing the exit condition to fail and us then to remove the last
gxact from memory even when it didn't match the xid, removing a valid
entry too early. That then allowed xmin to move forwards, which causes
autovac to remove pg_xact entries earlier than needed.

Well done for finding that one, thanks for the patch.

Running Jeff's test suite, I can confirm that there are no problems now.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Simon Riggs
simon@2ndquadrant.com
In reply to: Michael Paquier (#9)
Re: Failed recovery with new faster 2PC code

On 18 April 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote:

On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yeh, this is better. Pushed.

I have been outraced on this one, the error is obvious once you see it ;)

Didn't realise you were working on it, nothing competitive about it.

It's clear this needed fixing, whether or not it fixes Jeff's report.

I do think it explains the report, so I'm hopeful.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#10)
Re: Failed recovery with new faster 2PC code

On Tue, Apr 18, 2017 at 5:38 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 18 April 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon@2ndquadrant.com>

wrote:

Yeh, this is better. Pushed.

I have been outraced on this one, the error is obvious once you see it ;)

Didn't realise you were working on it, nothing competitive about it.

It's clear this needed fixing, whether or not it fixes Jeff's report.

I do think it explains the report, so I'm hopeful.

The git HEAD code (c727f12) has been surviving so far, with both test cases.

Thanks,

Jeff

#12Jeff Janes
jeff.janes@gmail.com
In reply to: Nikhil Sontakke (#6)
Re: Failed recovery with new faster 2PC code

On Tue, Apr 18, 2017 at 1:17 AM, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:

Hi,

There was a bug in the redo 2PC remove code path. Because of which,
autovac would think that the 2PC is gone and cause removal of the
corresponding clog entry earlier than needed.

Please find attached, the bug fix: 2pc_redo_remove_bug.patch.

I have been testing this on top of Michael's 2pc-restore-fix.patch and
things seem to be ok for the past one+ hour. Will keep it running for long.

Jeff, thanks for these very useful scripts. I am going to make a habit to
run these scripts on my side from now on. Do you have any other script that
I could try against these patches? Please let me know.

This script is the only one I have that specifically targets 2PC. I wrote
it last year when the previous round of speed-up code (which avoided
writing the files upon "PREPARE" by delaying them until the next
checkpoint) was developed. I just decided to dust that test off to try
again here. I don't know how to change it to make it more targeted towards
this set of patches. Would this bug have been seen in a replica server in
the absence of crashes, or was it only vulnerable during crash recovery
rather than streaming replication?

Cheers,

Jeff

#13Michael Paquier
michael.paquier@gmail.com
In reply to: Jeff Janes (#12)
Re: Failed recovery with new faster 2PC code

On Wed, Apr 19, 2017 at 11:09 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

Would this bug have been seen in a replica server in the absence of crashes,
or was it only vulnerable during crash recovery rather than streaming
replication?

This issue could have been seen on a streaming standby as well,
letting around a TwoPhaseState impacts as well their linked PGPROC so
CLOG truncation would have been messed up as well. That's also the
case of the first issue involving as well incorrect XID updates,
though the chances to see it are I think lower as only the beginning
of recovery was impacted.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers