PANIC in pg_commit_ts slru after crashes

Started by Jeff Janesover 8 years ago10 messages
#1Jeff Janes
jeff.janes@gmail.com
4 attachment(s)

In the first statement executed after crash recovery, I sometimes get this
error:

PANIC: XX000: could not access status of transaction 207580505
DETAIL: Could not read from file "pg_commit_ts/1EF0" at offset 131072:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

Other examples:

PANIC: XX000: could not access status of transaction 3483853232
DETAIL: Could not read from file "pg_commit_ts/20742" at offset 237568:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

PANIC: XX000: could not access status of transaction 802552883
DETAIL: Could not read from file "pg_commit_ts/779E" at offset 114688:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

Based on the errno, I'm assuming the read was successful but returned the
wrong number of bytes (which was zero in the case I saw after changing the
code to log short reads).

It then goes through recovery again and the problem does not immediately
re-occur if you attempt to connect again. I don't know why the file size
would have changed between attempts.

The problem bisects to the commit:

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

It is not obvious to me how that is relevant. My test doesn't use prepared
transactions (and leaves the guc at zero), and this commit doesn't touch
the slru.c.

I'm attaching the test harness. There is a patch which injects the
crash-faults and also allows xid fast-forward, a perl script that runs
until crash and assesses the consistency of the post-crash results, and a
shell script which sets up the database and then calls the perl script in a
loop. On 8 CPU machine, it takes about an hour for the PANIC to occur.

The attached script bails out once it sees the PANIC (count.pl line 158) if
it didn't do that then it will proceed to connect again and retry, and
works fine. No apparent permanent data corruption.

Any clues on how to investigate this further?

Cheers,

Jeff

Attachments:

count.plapplication/octet-stream; name=count.plDownload
crash_REL10.patchapplication/octet-stream; name=crash_REL10.patchDownload
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
new file mode 100644
index 5efbfbd..f7f8a6d
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 33,38 ****
--- 33,41 ----
  /* pointer to "variable cache" in shared memory (set up by shmem.c) */
  VariableCache ShmemVariableCache = NULL;
  
+ int JJ_xid=0;
+ extern int	JJ_vac;
+ 
  
  /*
   * Allocate the next XID for a new transaction or subtransaction.
*************** GetNewTransactionId(bool isSubXact)
*** 168,173 ****
--- 171,181 ----
  	 *
  	 * Extend pg_subtrans and pg_commit_ts too.
  	 */
+ 	{
+ 	int		incr;
+ 	for (incr=0; incr <=JJ_xid; incr++)
+ 	{
+ 	xid = ShmemVariableCache->nextXid;
  	ExtendCLOG(xid);
  	ExtendCommitTs(xid);
  	ExtendSUBTRANS(xid);
*************** GetNewTransactionId(bool isSubXact)
*** 179,184 ****
--- 187,194 ----
  	 * more XIDs until there is CLOG space for them.
  	 */
  	TransactionIdAdvance(ShmemVariableCache->nextXid);
+ 	}
+ 	}
  
  	/*
  	 * We must store the new XID into the shared ProcArray before releasing
*************** SetTransactionIdLimit(TransactionId olde
*** 363,370 ****
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(DEBUG1,
! 			(errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
--- 373,380 ----
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(LOG,
! 			(errmsg("JJ transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
*************** ForceTransactionIdLimitUpdate(void)
*** 441,446 ****
--- 451,467 ----
  	oldestXidDB = ShmemVariableCache->oldestXidDB;
  	LWLockRelease(XidGenLock);
  
+ 	if (JJ_vac) {
+ 	elog(LOG,"JJ ForceTransactionIdLimitUpdate in %d: !normal %d, !valid %d, follows %d (%u, %u), !exists %d", MyDatabaseId,
+ 		!TransactionIdIsNormal(oldestXid),
+ 		!TransactionIdIsValid(xidVacLimit),
+ 		TransactionIdFollowsOrEquals(nextXid, xidVacLimit),
+ 		nextXid, xidVacLimit,
+  		!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(oldestXidDB))
+ 	);
+ 	};
+ 
+ 
  	if (!TransactionIdIsNormal(oldestXid))
  		return true;			/* shouldn't happen, but just in case */
  	if (!TransactionIdIsValid(xidVacLimit))
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index b99ded5..f3ad6e2
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** BootStrapXLOG(void)
*** 5017,5022 ****
--- 5017,5023 ----
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	AdvanceOldestClogXid(checkPoint.oldestXid);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
  	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
*************** StartupXLOG(void)
*** 6624,6629 ****
--- 6625,6631 ----
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
  	AdvanceOldestClogXid(checkPoint.oldestXid);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
  	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
*************** xlog_redo(XLogReaderState *record)
*** 9623,9628 ****
--- 9625,9632 ----
  
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
+ 
  		/*
  		 * No need to set oldestClogXid here as well; it'll be set when we
  		 * redo an xl_clog_truncate if it changed since initialization.
*************** xlog_redo(XLogReaderState *record)
*** 9724,9729 ****
--- 9728,9734 ----
  		 */
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ maybe SetTransactionIDLimit %d", checkPoint.oldestXid);
  		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
  								  checkPoint.oldestXid))
  			SetTransactionIdLimit(checkPoint.oldestXid,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
new file mode 100644
index 9fbb0eb..91ec92c
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** int			vacuum_freeze_min_age;
*** 59,64 ****
--- 59,66 ----
  int			vacuum_freeze_table_age;
  int			vacuum_multixact_freeze_min_age;
  int			vacuum_multixact_freeze_table_age;
+ int			JJ_vac=0;
+ 
  
  
  /* A few variables that don't seem worth passing around as parameters */
*************** vacuum_set_xid_limits(Relation rel,
*** 568,573 ****
--- 570,576 ----
  	}
  
  	*freezeLimit = limit;
+ 	if (JJ_vac) elog(LOG,"JJ freezeLimit %d", *freezeLimit);
  
  	/*
  	 * Compute the multixact age for which freezing is urgent.  This is
*************** vacuum_set_xid_limits(Relation rel,
*** 622,627 ****
--- 625,632 ----
  		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
  		 * before anti-wraparound autovacuum is launched.
  		 */
+ 		if (JJ_vac) elog(LOG,"JJ freeze_min_age %d vacuum_freeze_table_age %d freeze_table_age %d ReadNew %d", freeze_min_age, 
+                            vacuum_freeze_table_age, freeze_table_age,ReadNewTransactionId());
  		freezetable = freeze_table_age;
  		if (freezetable < 0)
  			freezetable = vacuum_freeze_table_age;
*************** vac_update_datfrozenxid(void)
*** 1059,1064 ****
--- 1064,1070 ----
  	 * truncate pg_xact and/or pg_multixact.  Also do it if the shared
  	 * XID-wrap-limit info is stale, since this action will update that too.
  	 */
+ 	if (JJ_vac && dirty) elog(LOG,"JJ updating in %d without call to ForceTransactionIdLimitUpdate", MyDatabaseId);
  	if (dirty || ForceTransactionIdLimitUpdate())
  		vac_truncate_clog(newFrozenXid, newMinMulti,
  						  lastSaneFrozenXid, lastSaneMinMulti);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index b74e493..041cffd
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 64,69 ****
--- 64,70 ----
  #include "utils/tqual.h"
  
  
+ extern int JJ_vac;
  /*
   * Space/time tradeoff parameters: do these need to be user-tunable?
   *
*************** lazy_vacuum_rel(Relation onerel, int opt
*** 243,248 ****
--- 244,251 ----
  	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
  		aggressive = true;
  
+ 	if (JJ_vac) elog(LOG,"JJ aggresive %d, relfrozenid %d", aggressive, onerel->rd_rel->relfrozenxid);
+ 
  	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
  
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
new file mode 100644
index 33ca749..2dd9df2
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int			autovacuum_vac_cost_delay;
*** 122,127 ****
--- 122,128 ----
  int			autovacuum_vac_cost_limit;
  
  int			Log_autovacuum_min_duration = -1;
+ extern int  JJ_vac;
  
  /* how long to keep pgstat data in the launcher, in milliseconds */
  #define STATS_READ_DELAY 1000
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1636,1643 ****
  		InitPostgres(NULL, dbid, NULL, InvalidOid, dbname);
  		SetProcessingMode(NormalProcessing);
  		set_ps_display(dbname, false);
- 		ereport(DEBUG1,
- 				(errmsg("autovacuum: processing database \"%s\"", dbname)));
  
  		if (PostAuthDelay)
  			pg_usleep(PostAuthDelay * 1000000L);
--- 1637,1642 ----
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1645,1651 ****
--- 1644,1654 ----
  		/* And do an appropriate amount of work */
  		recentXid = ReadNewTransactionId();
  		recentMulti = ReadNextMultiXactId();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,recentXid,recentMulti)));
  		do_autovacuum();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: done processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,ReadNewTransactionId(),ReadNextMultiXactId())));
  	}
  
  	/*
*************** relation_needs_vacanalyze(Oid relid,
*** 2856,2868 ****
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
- 		elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
- 			 NameStr(classForm->relname),
- 			 vactuples, vacthresh, anltuples, anlthresh);
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
  	}
  	else
  	{
--- 2859,2872 ----
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
+ 
+ 		if (JJ_vac) elog(LOG, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f) wraparound %d dovaccum %d doanalyze %d",
+ 			 NameStr(classForm->relname),
+ 			 vactuples, vacthresh, anltuples, anlthresh, *wraparound, *dovacuum, *doanalyze);
  	}
  	else
  	{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
new file mode 100644
index b0b596d..b903ddb
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 68,73 ****
--- 68,75 ----
  #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
  #endif
  
+ int JJ_torn_page=0;
+ 
  /*
   *	The magnetic disk storage manager keeps track of open file
   *	descriptors in its own descriptor pool.  This is done to make it
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 805,810 ****
--- 807,813 ----
  	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
+         static int counter=0;
  
  	/* This assert is too expensive to have on normally ... */
  #ifdef CHECK_WRITE_VS_EXTEND
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 830,836 ****
--- 833,850 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
+         if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) {
+ 		nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3, WAIT_EVENT_DATA_FILE_WRITE);
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_DISK_FULL),
+ 				 errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
+ 						blocknum,
+ 						relpath(reln->smgr_rnode, forknum),
+ 						nbytes, BLCKSZ),
+ 				 errhint("JJ is screwing with the database.")));
+         } else {
  	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, WAIT_EVENT_DATA_FILE_WRITE);
+ 	}
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 291bf76..f81e7dd
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 111,116 ****
--- 111,119 ----
  /* XXX these should appear in other modules' header files */
  extern bool Log_disconnections;
  extern int	CommitDelay;
+ int	JJ_torn_page;
+ extern int	JJ_xid;
+ extern int	JJ_vac;
  extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
*************** static struct config_int ConfigureNamesI
*** 2368,2373 ****
--- 2371,2403 ----
  	},
  
  	{
+ 		{"JJ_torn_page", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Simulate a torn-page crash after this number of page writes (0 to turn off)"),
+ 			NULL
+ 		},
+ 		&JJ_torn_page,
+ 		0, 0, 100000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_xid", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Skip this many xid every time we acquire one"),
+ 			NULL
+ 		},
+ 		&JJ_xid,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_vac", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("turn on verbose logging"),
+ 			NULL
+ 		},
+ 		&JJ_vac,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
  		{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
  			gettext_noop("Sets the minimum concurrent open transactions before performing "
  						 "commit_delay."),
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
new file mode 100644
index f3b3529..ef4c90d
*** a/src/include/pg_config_manual.h
--- b/src/include/pg_config_manual.h
***************
*** 300,306 ****
  /*
   * Enable debugging print statements for lock-related operations.
   */
! /* #define LOCK_DEBUG */
  
  /*
   * Enable debugging print statements for WAL-related operations; see
--- 300,306 ----
  /*
   * Enable debugging print statements for lock-related operations.
   */
! #define LOCK_DEBUG 1
  
  /*
   * Enable debugging print statements for WAL-related operations; see
do.shapplication/x-sh; name=do.shDownload
slru_log_read_size.patchapplication/octet-stream; name=slru_log_read_size.patchDownload
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
new file mode 100644
index 7ae7831..87059f4
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
*************** SlruPhysicalReadPage(SlruCtl ctl, int pa
*** 640,645 ****
--- 640,646 ----
  	int			offset = rpageno * BLCKSZ;
  	char		path[MAXPGPATH];
  	int			fd;
+ 	int			sz;
  
  	SlruFileName(ctl, path, segno);
  
*************** SlruPhysicalReadPage(SlruCtl ctl, int pa
*** 677,688 ****
  
  	errno = 0;
  	pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
! 	if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
  	{
  		pgstat_report_wait_end();
  		slru_errcause = SLRU_READ_FAILED;
  		slru_errno = errno;
  		CloseTransientFile(fd);
  		return false;
  	}
  	pgstat_report_wait_end();
--- 678,691 ----
  
  	errno = 0;
  	pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
! 	sz = read(fd, shared->page_buffer[slotno], BLCKSZ);
! 	if (sz != BLCKSZ)
  	{
  		pgstat_report_wait_end();
  		slru_errcause = SLRU_READ_FAILED;
  		slru_errno = errno;
  		CloseTransientFile(fd);
+ 		elog(LOG,"JJ slru read obtained %d bytes with errno %d", sz, slru_errno);
  		return false;
  	}
  	pgstat_report_wait_end();
#2Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Jeff Janes (#1)
1 attachment(s)
Re: PANIC in pg_commit_ts slru after crashes

On Sat, Apr 15, 2017 at 12:53 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

In the first statement executed after crash recovery, I sometimes get this
error:

PANIC: XX000: could not access status of transaction 207580505
DETAIL: Could not read from file "pg_commit_ts/1EF0" at offset 131072:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

Other examples:

PANIC: XX000: could not access status of transaction 3483853232
DETAIL: Could not read from file "pg_commit_ts/20742" at offset 237568:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

PANIC: XX000: could not access status of transaction 802552883
DETAIL: Could not read from file "pg_commit_ts/779E" at offset 114688:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

Based on the errno, I'm assuming the read was successful but returned the
wrong number of bytes (which was zero in the case I saw after changing the
code to log short reads).

It then goes through recovery again and the problem does not immediately
re-occur if you attempt to connect again. I don't know why the file size
would have changed between attempts.

The problem bisects to the commit:

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

It is not obvious to me how that is relevant. My test doesn't use
prepared transactions (and leaves the guc at zero), and this commit doesn't
touch the slru.c.

I'm attaching the test harness. There is a patch which injects the
crash-faults and also allows xid fast-forward, a perl script that runs
until crash and assesses the consistency of the post-crash results, and a
shell script which sets up the database and then calls the perl script in a
loop. On 8 CPU machine, it takes about an hour for the PANIC to occur.

The attached script bails out once it sees the PANIC (count.pl line 158)
if it didn't do that then it will proceed to connect again and retry, and
works fine. No apparent permanent data corruption.

Any clues on how to investigate this further?

Since all those offsets fall on a page boundary, my guess is that we're
somehow failing to handle a new page correctly.

Looking at the patch itself, my feeling is that the following code
in src/backend/access/transam/twophase.c might be causing the problem.

1841
1842 /* update nextXid if needed */
1843 if (TransactionIdFollowsOrEquals(maxsubxid,
ShmemVariableCache->nextXid))
1844 {
1845 LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
1846 ShmemVariableCache->nextXid = maxsubxid;
1847 TransactionIdAdvance(ShmemVariableCache->nextXid);
1848 LWLockRelease(XidGenLock);
1849 }

The function PrescanPreparedTransactions() gets called at the start of the
redo recovery and this specific block will get exercised irrespective of
whether there are any prepared transactions or not. What I find
particularly wrong here is that we are initialising maxsubxid to current
value of ShmemVariableCache->nextXid when the function enters, but this
block would then again increment ShmemVariableCache->nextXid, when there
are no prepared transactions in the system.

I wonder if we should do as in attached patch.

It's not entirely clear to me why only CommitTS fails and not other slrus.
One possible reason could be that CommitTS is started before this function
gets called whereas CLOG and SUBTRANS get started after the function
returns.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

fix_commit_ts_crash.patchapplication/octet-stream; name=fix_commit_ts_crash.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7982e16..6ba545a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1792,6 +1792,12 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 	int			allocsize = 0;
 	int			i;
 
+	/*
+	 * Set result and maxsubxid to the largest XID known to us.
+	 */
+	TransactionIdRetreat(result);
+	TransactionIdRetreat(maxsubxid);
+
 	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
#3Jeff Janes
jeff.janes@gmail.com
In reply to: Pavan Deolasee (#2)
Re: PANIC in pg_commit_ts slru after crashes

On Fri, Apr 14, 2017 at 9:33 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

Since all those offsets fall on a page boundary, my guess is that we're
somehow failing to handle a new page correctly.

Looking at the patch itself, my feeling is that the following code
in src/backend/access/transam/twophase.c might be causing the problem.

1841
1842 /* update nextXid if needed */
1843 if (TransactionIdFollowsOrEquals(maxsubxid,
ShmemVariableCache->nextXid))
1844 {
1845 LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
1846 ShmemVariableCache->nextXid = maxsubxid;
1847 TransactionIdAdvance(ShmemVariableCache->nextXid);
1848 LWLockRelease(XidGenLock);
1849 }

The function PrescanPreparedTransactions() gets called at the start of
the redo recovery and this specific block will get exercised irrespective
of whether there are any prepared transactions or not. What I find
particularly wrong here is that we are initialising maxsubxid to current
value of ShmemVariableCache->nextXid when the function enters, but this
block would then again increment ShmemVariableCache->nextXid, when there
are no prepared transactions in the system.

I wonder if we should do as in attached patch.

That solves it for me.

Thanks,

Jeff

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Jeff Janes (#3)
Re: PANIC in pg_commit_ts slru after crashes

On 15 April 2017 at 21:30, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Apr 14, 2017 at 9:33 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

Since all those offsets fall on a page boundary, my guess is that we're
somehow failing to handle a new page correctly.

Looking at the patch itself, my feeling is that the following code in
src/backend/access/transam/twophase.c might be causing the problem.

1841
1842 /* update nextXid if needed */
1843 if (TransactionIdFollowsOrEquals(maxsubxid,
ShmemVariableCache->nextXid))
1844 {
1845 LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
1846 ShmemVariableCache->nextXid = maxsubxid;
1847 TransactionIdAdvance(ShmemVariableCache->nextXid);
1848 LWLockRelease(XidGenLock);
1849 }

The function PrescanPreparedTransactions() gets called at the start of the
redo recovery and this specific block will get exercised irrespective of
whether there are any prepared transactions or not. What I find particularly
wrong here is that we are initialising maxsubxid to current value of
ShmemVariableCache->nextXid when the function enters, but this block would
then again increment ShmemVariableCache->nextXid, when there are no prepared
transactions in the system.

I wonder if we should do as in attached patch.

That solves it for me.

Thanks for patching and testing. I'll review and probably commit tomorrow.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#4)
1 attachment(s)
Re: PANIC in pg_commit_ts slru after crashes

On Sun, Apr 16, 2017 at 6:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 15 April 2017 at 21:30, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Apr 14, 2017 at 9:33 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

Since all those offsets fall on a page boundary, my guess is that we're
somehow failing to handle a new page correctly.

Looking at the patch itself, my feeling is that the following code in
src/backend/access/transam/twophase.c might be causing the problem.

1841
1842 /* update nextXid if needed */
1843 if (TransactionIdFollowsOrEquals(maxsubxid,
ShmemVariableCache->nextXid))
1844 {
1845 LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
1846 ShmemVariableCache->nextXid = maxsubxid;
1847 TransactionIdAdvance(ShmemVariableCache->nextXid);
1848 LWLockRelease(XidGenLock);
1849 }

The function PrescanPreparedTransactions() gets called at the start of the
redo recovery and this specific block will get exercised irrespective of
whether there are any prepared transactions or not. What I find particularly
wrong here is that we are initialising maxsubxid to current value of
ShmemVariableCache->nextXid when the function enters, but this block would
then again increment ShmemVariableCache->nextXid, when there are no prepared
transactions in the system.

I wonder if we should do as in attached patch.

That solves it for me.

Thanks for patching and testing. I'll review and probably commit tomorrow.

Actually I think that the fix proposed is not correct as we should
just never take the risk to call TransactionIdAdvance() if there are
no 2PC files to scan, and it adds more difficulty in understanding
something that the 2PC restore code has introduced and already made
harder to catch (2nd thoughts and regrets here). If you look closely
at the code, ProcessTwoPhaseBuffer() uses *result and *maxsubid only
for PrescanPreparedTransactions() so there is a link between both.
Attached is a patch to rework ProcessTwoPhaseBuffer() by removing
those two arguments and replace them by a boolean to decide if the
next cached transaction ID should be updated or not. The cached next
TXID gets updated only if caller of ProcessTwoPhaseBuffer() wants to
do so and if the subxid scanned follows the XID of the scanned 2PC
file. This looks safer to me in the long run.

Jeff, does this patch make the situation better? The fix is rather
simple as it just makes sure that the next XID never gets updated if
there are no 2PC files.
--
Michael

Attachments:

2pc-restore-fix.patchapplication/octet-stream; name=2pc-restore-fix.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7982e16abc..723552bfb2 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -222,7 +222,7 @@ static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
 static char *ProcessTwoPhaseBuffer(TransactionId xid,
 							XLogRecPtr	prepare_start_lsn,
 							bool fromdisk, bool overwriteOK, bool setParent,
-							TransactionId *result, TransactionId *maxsubxid);
+							bool setNextXid);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 				const char *gid, TimestampTz prepared_at, Oid owner,
 				Oid databaseid);
@@ -1744,7 +1744,7 @@ restoreTwoPhaseData(void)
 
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false,
-										NULL, NULL);
+										false);
 			if (buf == NULL)
 				continue;
 
@@ -1786,7 +1786,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
 	TransactionId result = origNextXid;
-	TransactionId maxsubxid = origNextXid;
 	TransactionId *xids = NULL;
 	int			nxids = 0;
 	int			allocsize = 0;
@@ -1806,11 +1805,18 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, false, false,
-				&result, &maxsubxid);
+				true);
 
 		if (buf == NULL)
 			continue;
 
+		/*
+		 * OK, we think this file is valid.  Incorporate xid into the
+		 * running-minimum result.
+		 */
+		if (TransactionIdPrecedes(xid, result))
+			result = xid;
+
 		if (xids_p)
 		{
 			if (nxids == allocsize)
@@ -1839,15 +1845,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		*nxids_p = nxids;
 	}
 
-	/* update nextXid if needed */
-	if (TransactionIdFollowsOrEquals(maxsubxid, ShmemVariableCache->nextXid))
-	{
-		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextXid = maxsubxid;
-		TransactionIdAdvance(ShmemVariableCache->nextXid);
-		LWLockRelease(XidGenLock);
-	}
-
 	return result;
 }
 
@@ -1884,7 +1881,7 @@ StandbyRecoverPreparedTransactions(bool overwriteOK)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, overwriteOK, true,
-				NULL, NULL);
+				false);
 		if (buf != NULL)
 			pfree(buf);
 	}
@@ -1924,7 +1921,7 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, false, false,
-				NULL, NULL);
+				false);
 		if (buf == NULL)
 			continue;
 
@@ -2012,20 +2009,16 @@ RecoverPreparedTransactions(void)
  * If setParent is true, then use the overwriteOK parameter to set up
  * subtransaction parent linkages.
  *
- * If result and maxsubxid are not NULL, fill them up with smallest
- * running transaction id (lesser than ShmemVariableCache->nextXid)
- * and largest subtransaction id for this transaction respectively.
+ * If setNextXid is true, set ShmemVariableCache->nextXid to the newest
+ * value scanned.
  */
 static char *
 ProcessTwoPhaseBuffer(TransactionId xid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk, bool overwriteOK,
-					  bool setParent, TransactionId *result,
-					  TransactionId *maxsubxid)
+					  bool setParent, bool setNextXid)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
-	TransactionId res = InvalidTransactionId;
-	TransactionId maxsub = InvalidTransactionId;
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2034,11 +2027,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	if (result)
-		res = *result;
-	if (maxsubxid)
-		maxsub = *maxsubxid;
-
 	/* Already processed? */
 	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
 	{
@@ -2121,13 +2109,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	}
 
 	/*
-	 * OK, we think this buffer is valid.  Incorporate xid into the
-	 * running-minimum result.
-	 */
-	if (TransactionIdPrecedes(xid, res))
-		res = xid;
-
-	/*
 	 * Examine subtransaction XIDs ... they should all follow main
 	 * XID, and they may force us to advance nextXid.
 	 */
@@ -2139,17 +2120,27 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		TransactionId subxid = subxids[i];
 
 		Assert(TransactionIdFollows(subxid, xid));
-		if (TransactionIdFollowsOrEquals(subxid, maxsub))
-			maxsub = subxid;
+
+		/* update nextXid if needed */
+		if (setNextXid &&
+			TransactionIdFollowsOrEquals(subxid,
+										 ShmemVariableCache->nextXid))
+		{
+			/*
+			 * We don't expect anyone else to modify nextXid, hence we don't
+			 * need to hold a lock while examining it.  We still acquire the
+			 * lock to modify it, though.
+			 */
+			LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+			ShmemVariableCache->nextXid = subxid;
+			TransactionIdAdvance(ShmemVariableCache->nextXid);
+			LWLockRelease(XidGenLock);
+		}
+
 		if (setParent)
 			SubTransSetParent(xid, subxid, overwriteOK);
 	}
 
-	if (result)
-		*result = res;
-	if (maxsubxid)
-		*maxsubxid = maxsub;
-
 	return buf;
 }
 
#6Jeff Janes
jeff.janes@gmail.com
In reply to: Michael Paquier (#5)
Re: PANIC in pg_commit_ts slru after crashes

On Sun, Apr 16, 2017 at 6:59 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Jeff, does this patch make the situation better? The fix is rather
simple as it just makes sure that the next XID never gets updated if
there are no 2PC files.

Yes, that fixes the reported case when 2PC are not being used.

Thanks,

Jeff

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Jeff Janes (#6)
Re: PANIC in pg_commit_ts slru after crashes

On Tue, Apr 18, 2017 at 12:33 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Apr 16, 2017 at 6:59 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Jeff, does this patch make the situation better? The fix is rather
simple as it just makes sure that the next XID never gets updated if
there are no 2PC files.

Yes, that fixes the reported case when 2PC are not being used.

Thanks for the confirmation. I am able to run your test suite by the
way after a couple of tweaks, and I can see the failures. Still
investigating the 2nd report...
--
Michael
VMware vCenter Server
www.vmware.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Simon Riggs
simon@2ndquadrant.com
In reply to: Jeff Janes (#6)
Re: PANIC in pg_commit_ts slru after crashes

On 17 April 2017 at 16:33, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Apr 16, 2017 at 6:59 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Jeff, does this patch make the situation better? The fix is rather
simple as it just makes sure that the next XID never gets updated if
there are no 2PC files.

Yes, that fixes the reported case when 2PC are not being used.

Thanks Jeff.

I certainly prefer the simplicity of Michael's approach. I'll move to commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#8)
1 attachment(s)
Re: PANIC in pg_commit_ts slru after crashes

On 18 April 2017 at 09:51, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 April 2017 at 16:33, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sun, Apr 16, 2017 at 6:59 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Jeff, does this patch make the situation better? The fix is rather
simple as it just makes sure that the next XID never gets updated if
there are no 2PC files.

Yes, that fixes the reported case when 2PC are not being used.

Thanks Jeff.

I certainly prefer the simplicity of Michael's approach. I'll move to commit.

Minor change to patch.

I've added a recheck in ProcessTwoPhaseBuffer() after we acquire the lock.

If its worth acquiring the lock its worth checking we don't have a race.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

2pc-restore-fix.v2.patchapplication/octet-stream; name=2pc-restore-fix.v2.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7982e16abc..6e6678cfff 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -222,7 +222,7 @@ static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
 static char *ProcessTwoPhaseBuffer(TransactionId xid,
 							XLogRecPtr	prepare_start_lsn,
 							bool fromdisk, bool overwriteOK, bool setParent,
-							TransactionId *result, TransactionId *maxsubxid);
+							bool setNextXid);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 				const char *gid, TimestampTz prepared_at, Oid owner,
 				Oid databaseid);
@@ -1744,7 +1744,7 @@ restoreTwoPhaseData(void)
 
 			buf = ProcessTwoPhaseBuffer(xid, InvalidXLogRecPtr,
 										true, false, false,
-										NULL, NULL);
+										false);
 			if (buf == NULL)
 				continue;
 
@@ -1786,7 +1786,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
 	TransactionId result = origNextXid;
-	TransactionId maxsubxid = origNextXid;
 	TransactionId *xids = NULL;
 	int			nxids = 0;
 	int			allocsize = 0;
@@ -1806,11 +1805,18 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, false, false,
-				&result, &maxsubxid);
+				true);
 
 		if (buf == NULL)
 			continue;
 
+		/*
+		 * OK, we think this file is valid.  Incorporate xid into the
+		 * running-minimum result.
+		 */
+		if (TransactionIdPrecedes(xid, result))
+			result = xid;
+
 		if (xids_p)
 		{
 			if (nxids == allocsize)
@@ -1839,15 +1845,6 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 		*nxids_p = nxids;
 	}
 
-	/* update nextXid if needed */
-	if (TransactionIdFollowsOrEquals(maxsubxid, ShmemVariableCache->nextXid))
-	{
-		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextXid = maxsubxid;
-		TransactionIdAdvance(ShmemVariableCache->nextXid);
-		LWLockRelease(XidGenLock);
-	}
-
 	return result;
 }
 
@@ -1884,7 +1881,7 @@ StandbyRecoverPreparedTransactions(bool overwriteOK)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, overwriteOK, true,
-				NULL, NULL);
+				false);
 		if (buf != NULL)
 			pfree(buf);
 	}
@@ -1924,7 +1921,7 @@ RecoverPreparedTransactions(void)
 		buf = ProcessTwoPhaseBuffer(xid,
 				gxact->prepare_start_lsn,
 				gxact->ondisk, false, false,
-				NULL, NULL);
+				false);
 		if (buf == NULL)
 			continue;
 
@@ -2012,20 +2009,16 @@ RecoverPreparedTransactions(void)
  * If setParent is true, then use the overwriteOK parameter to set up
  * subtransaction parent linkages.
  *
- * If result and maxsubxid are not NULL, fill them up with smallest
- * running transaction id (lesser than ShmemVariableCache->nextXid)
- * and largest subtransaction id for this transaction respectively.
+ * If setNextXid is true, set ShmemVariableCache->nextXid to the newest
+ * value scanned.
  */
 static char *
 ProcessTwoPhaseBuffer(TransactionId xid,
 					  XLogRecPtr prepare_start_lsn,
 					  bool fromdisk, bool overwriteOK,
-					  bool setParent, TransactionId *result,
-					  TransactionId *maxsubxid)
+					  bool setParent, bool setNextXid)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
-	TransactionId res = InvalidTransactionId;
-	TransactionId maxsub = InvalidTransactionId;
 	TransactionId *subxids;
 	char	   *buf;
 	TwoPhaseFileHeader *hdr;
@@ -2034,11 +2027,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!fromdisk)
 		Assert(prepare_start_lsn != InvalidXLogRecPtr);
 
-	if (result)
-		res = *result;
-	if (maxsubxid)
-		maxsub = *maxsubxid;
-
 	/* Already processed? */
 	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
 	{
@@ -2121,13 +2109,6 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	}
 
 	/*
-	 * OK, we think this buffer is valid.  Incorporate xid into the
-	 * running-minimum result.
-	 */
-	if (TransactionIdPrecedes(xid, res))
-		res = xid;
-
-	/*
 	 * Examine subtransaction XIDs ... they should all follow main
 	 * XID, and they may force us to advance nextXid.
 	 */
@@ -2139,17 +2120,31 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		TransactionId subxid = subxids[i];
 
 		Assert(TransactionIdFollows(subxid, xid));
-		if (TransactionIdFollowsOrEquals(subxid, maxsub))
-			maxsub = subxid;
+
+		/* update nextXid if needed */
+		if (setNextXid &&
+			TransactionIdFollowsOrEquals(subxid,
+										 ShmemVariableCache->nextXid))
+		{
+			/*
+			 * We don't expect anyone else to modify nextXid, hence we don't
+			 * need to hold a lock while examining it.  We still acquire the
+			 * lock to modify it, though, so we recheck.
+			 */
+			LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+			if (TransactionIdFollowsOrEquals(subxid,
+										 ShmemVariableCache->nextXid))
+			{
+				ShmemVariableCache->nextXid = subxid;
+				TransactionIdAdvance(ShmemVariableCache->nextXid);
+			}
+			LWLockRelease(XidGenLock);
+		}
+
 		if (setParent)
 			SubTransSetParent(xid, subxid, overwriteOK);
 	}
 
-	if (result)
-		*result = res;
-	if (maxsubxid)
-		*maxsubxid = maxsub;
-
 	return buf;
 }
 
#10Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#9)
Re: PANIC in pg_commit_ts slru after crashes

On Tue, Apr 18, 2017 at 7:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I've added a recheck in ProcessTwoPhaseBuffer() after we acquire the lock.

If its worth acquiring the lock its worth checking we don't have a race.

I see. No objections to that.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers