Assertion failure twophase.c (2) (testing HS/SR)

Started by Erik Rijkersalmost 16 years ago8 messages
#1Erik Rijkers
er@xs4all.nl

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
fix-KnownAssignedXidsRemoveMany-1.patch

pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

The standby was restarted and seems to catch up OK again.

LOG: database system was interrupted; last known up at 2010-03-04 01:35:23 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
FATAL: the database system is starting up
LOG: entering standby mode
LOG: redo starts at 0/1000020
LOG: consistent recovery state reached at 0/2000000
LOG: database system is ready to accept read only connections
ERROR: cannot execute CREATE TABLE in a read-only transaction
STATEMENT: create table t (c text);
ERROR: cannot execute SELECT INTO in a read-only transaction
STATEMENT: create table t as select 1;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
FATAL: database "ms" does not exist
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
LOG: startup process (PID 18107) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: database system was interrupted while in recovery at log time 2010-03-04 05:00:24 CET
HINT: If this has occurred more than once some data might be corrupted and you might need to
choose an earlier recovery target.
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000001C0000007F':
No such file or directory
LOG: entering standby mode
LOG: redo starts at 1C/7800F8E0
LOG: consistent recovery state reached at 1C/ADB9C178
LOG: database system is ready to accept read only connections

The ERRORs (and FATAL) were accidentally issued commands; I can't tell if they were causing the
assertion. (database 'ms' indeed was not present on this instance)

see also:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02221.php

thanks,

Erik Rijkers

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Erik Rijkers (#1)
Re: Assertion failure twophase.c (2) (testing HS/SR)

Erik Rijkers wrote:

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch

Got a link to these two patches? I couldn't find them with a quick search.

fix-KnownAssignedXidsRemoveMany-1.patch

pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

The standby was restarted and seems to catch up OK again.
...
see also:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02221.php

I'm still not any wiser on what's causing that, but I've fixed the bug
in KnownAssignedXidsMany() now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#2)
2 attachment(s)
Re: Assertion failure twophase.c (2) (testing HS/SR)

On Thu, Mar 11, 2010 at 6:29 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Erik Rijkers wrote:

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

  new_smart_shutdown_20100201.patch

http://archives.postgresql.org/pgsql-hackers/2010-01/msg03116.php

  extend_format_of_recovery_info_funcs_v4.20100303.patch

http://archives.postgresql.org/pgsql-hackers/2010-03/msg00175.php

Got a link to these two patches? I couldn't find them with a quick search.

For your convenience, I attached those patches in this post.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

new_smart_shutdown_20100201.patchtext/x-patch; charset=US-ASCII; name=new_smart_shutdown_20100201.patchDownload
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 278,283 **** typedef enum
--- 278,284 ----
  	PM_RECOVERY_CONSISTENT,		/* consistent recovery mode */
  	PM_RUN,						/* normal "database is alive" state */
  	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
+ 	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
  	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
  	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to finish */
***************
*** 2165,2171 **** pmdie(SIGNAL_ARGS)
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 				pmState = PM_WAIT_BACKUP;
  			}
  
  			/*
--- 2166,2173 ----
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 				/* online backup mode is active only when normal processing */
! 				pmState = (pmState == PM_RUN) ? PM_WAIT_BACKUP : PM_WAIT_READONLY;
  			}
  
  			/*
***************
*** 2840,2845 **** PostmasterStateMachine(void)
--- 2842,2870 ----
  	}
  
  	/*
+ 	 * If we are in a state-machine state that implies waiting for read only
+ 	 * backends to exit, see if they're all gone, and change state if so.
+ 	 */
+ 	if (pmState == PM_WAIT_READONLY)
+ 	{
+ 		/*
+ 		 * PM_WAIT_READONLY state ends when we have no regular backends that
+ 		 * have been started during recovery. Since those backends might be
+ 		 * waiting for the WAL record that conflicts with their queries to be
+ 		 * replayed, recovery and replication need to remain until all read
+ 		 * only backends have been gone away.
+ 		 */
+ 		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
+ 		{
+ 			if (StartupPID != 0)
+ 				signal_child(StartupPID, SIGTERM);
+ 			if (WalReceiverPID != 0)
+ 				signal_child(WalReceiverPID, SIGTERM);
+ 			pmState = PM_WAIT_BACKENDS;
+ 		}
+ 	}
+ 
+ 	/*
  	 * If we are in a state-machine state that implies waiting for backends to
  	 * exit, see if they're all gone, and change state if so.
  	 */
extend_format_of_recovery_info_funcs_v4.patchtext/x-patch; charset=US-ASCII; name=extend_format_of_recovery_info_funcs_v4.patchDownload
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 13199,13204 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 13199,13208 ----
      This is usually the desired behavior for managing transaction log archiving
      behavior, since the preceding file is the last one that currently
      needs to be archived.
+     These functions also accept as a parameter the string that consists of timeline and
+     location, separated by a slash. In this case a transaction log file name is computed
+     by using the given timeline. On the other hand, if timeline is not supplied, the
+     current timeline is used for the computation.
     </para>
  
     <para>
***************
*** 13245,13257 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
          <literal><function>pg_last_xlog_receive_location</function>()</literal>
          </entry>
         <entry><type>text</type></entry>
!        <entry>Get last transaction log location received and synced to disk during
!         streaming recovery. If streaming recovery is still in progress
          this will increase monotonically. If streaming recovery has completed
          then this value will remain static at the value of the last WAL record
          received and synced to disk during that recovery. When the server has
          been started without a streaming recovery then the return value will be
!         InvalidXLogRecPtr (0/0).
         </entry>
        </row>
        <row>
--- 13249,13263 ----
          <literal><function>pg_last_xlog_receive_location</function>()</literal>
          </entry>
         <entry><type>text</type></entry>
!        <entry>Get timeline and location of last transaction log received and synced
!         to disk during streaming recovery. The return string is separated by a slash,
!         the first value indicates the timeline and the other the location.
!         If streaming recovery is still in progress
          this will increase monotonically. If streaming recovery has completed
          then this value will remain static at the value of the last WAL record
          received and synced to disk during that recovery. When the server has
          been started without a streaming recovery then the return value will be
!         <literal>0/0/0</>.
         </entry>
        </row>
        <row>
***************
*** 13259,13270 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
          <literal><function>pg_last_xlog_replay_location</function>()</literal>
          </entry>
         <entry><type>text</type></entry>
!        <entry>Get last transaction log location replayed during recovery.
          If recovery is still in progress this will increase monotonically.
          If recovery has completed then this value will remain static at
          the value of the last WAL record applied during that recovery.
          When the server has been started normally without a recovery
!         then the return value will be InvalidXLogRecPtr (0/0).
         </entry>
        </row>
       </tbody>
--- 13265,13278 ----
          <literal><function>pg_last_xlog_replay_location</function>()</literal>
          </entry>
         <entry><type>text</type></entry>
!        <entry>Get timeline and location of last transaction log replayed during
!         recovery. The return string is separated by a slash, the first value
!         indicates the timeline and the other the location.
          If recovery is still in progress this will increase monotonically.
          If recovery has completed then this value will remain static at
          the value of the last WAL record applied during that recovery.
          When the server has been started normally without a recovery
!         then the return value will be <literal>0/0/0</>.
         </entry>
        </row>
       </tbody>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 392,397 **** typedef struct XLogCtlData
--- 392,399 ----
  	TimestampTz recoveryLastXTime;
  	/* end+1 of the last record replayed */
  	XLogRecPtr	recoveryLastRecPtr;
+ 	/* tli of last record replayed */
+ 	TimeLineID	recoveryLastTLI;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
***************
*** 471,476 **** static uint32 readRecordBufSize = 0;
--- 473,479 ----
  static XLogRecPtr ReadRecPtr;	/* start of last record read */
  static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
  static TimeLineID lastPageTLI = 0;
+ static TimeLineID lastRecTLI = 0;	/* tli of last record read */
  
  static XLogRecPtr minRecoveryPoint;		/* local copy of
  										 * ControlFile->minRecoveryPoint */
***************
*** 3943,3949 **** ValidXLOGHeader(XLogPageHeader hdr, int emode)
  						readId, readSeg, readOff)));
  		return false;
  	}
! 	lastPageTLI = hdr->xlp_tli;
  	return true;
  }
  
--- 3946,3952 ----
  						readId, readSeg, readOff)));
  		return false;
  	}
! 	lastRecTLI = lastPageTLI = hdr->xlp_tli;
  	return true;
  }
  
***************
*** 5782,5791 **** StartupXLOG(void)
  			/* use volatile pointer to prevent code rearrangement */
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			/* initialize shared replayEndRecPtr and recoveryLastRecPtr */
  			SpinLockAcquire(&xlogctl->info_lck);
  			xlogctl->replayEndRecPtr = ReadRecPtr;
  			xlogctl->recoveryLastRecPtr = ReadRecPtr;
  			SpinLockRelease(&xlogctl->info_lck);
  
  			InRedo = true;
--- 5785,5801 ----
  			/* use volatile pointer to prevent code rearrangement */
  			volatile XLogCtlData *xlogctl = XLogCtl;
  
! 			/*
! 			 * initialize shared replayEndRecPtr, recoveryLastRecPtr and
! 			 * recoveryLastTLI. Actually, the latter two variables don't need to
! 			 * be initialized here since they are expected to be updated at least
! 			 * once until read only connections will have read them. But just in
! 			 * case.
! 			 */
  			SpinLockAcquire(&xlogctl->info_lck);
  			xlogctl->replayEndRecPtr = ReadRecPtr;
  			xlogctl->recoveryLastRecPtr = ReadRecPtr;
+ 			xlogctl->recoveryLastTLI = lastRecTLI;
  			SpinLockRelease(&xlogctl->info_lck);
  
  			InRedo = true;
***************
*** 5913,5923 **** StartupXLOG(void)
  				error_context_stack = errcontext.previous;
  
  				/*
! 				 * Update shared recoveryLastRecPtr after this record has been
! 				 * replayed.
  				 */
  				SpinLockAcquire(&xlogctl->info_lck);
  				xlogctl->recoveryLastRecPtr = EndRecPtr;
  				SpinLockRelease(&xlogctl->info_lck);
  
  				LastRec = ReadRecPtr;
--- 5923,5934 ----
  				error_context_stack = errcontext.previous;
  
  				/*
! 				 * Update shared recoveryLastRecPtr and recoveryLastTLI
! 				 * after this record has been replayed.
  				 */
  				SpinLockAcquire(&xlogctl->info_lck);
  				xlogctl->recoveryLastRecPtr = EndRecPtr;
+ 				xlogctl->recoveryLastTLI = lastRecTLI;
  				SpinLockRelease(&xlogctl->info_lck);
  
  				LastRec = ReadRecPtr;
***************
*** 7479,7484 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 7490,7496 ----
  			/* Following WAL records should be run with new TLI */
  			ThisTimeLineID = checkPoint.ThisTimeLineID;
  		}
+ 		lastRecTLI = ThisTimeLineID;
  
  		RecoveryRestartPoint(&checkPoint);
  	}
***************
*** 8274,8280 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
  }
  
  /*
!  * Report the last WAL receive location (same format as pg_start_backup etc)
   *
   * This is useful for determining how much of WAL is guaranteed to be received
   * and synced to disk by walreceiver.
--- 8286,8292 ----
  }
  
  /*
!  * Report the last WAL receive tli and location
   *
   * This is useful for determining how much of WAL is guaranteed to be received
   * and synced to disk by walreceiver.
***************
*** 8287,8299 **** pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
  
  	recptr = GetWalRcvWriteRecPtr();
  
! 	snprintf(location, sizeof(location), "%X/%X",
  			 recptr.xlogid, recptr.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(location));
  }
  
  /*
!  * Report the last WAL replay location (same format as pg_start_backup etc)
   *
   * This is useful for determining how much of WAL is visible to read-only
   * connections during recovery.
--- 8299,8312 ----
  
  	recptr = GetWalRcvWriteRecPtr();
  
! 	snprintf(location, sizeof(location), "%X/%X/%X",
! 			 XLogRecPtrIsInvalid(recptr) ? 0 : GetRecoveryTargetTLI(),
  			 recptr.xlogid, recptr.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(location));
  }
  
  /*
!  * Report the last WAL replay tli and location
   *
   * This is useful for determining how much of WAL is visible to read-only
   * connections during recovery.
***************
*** 8303,8317 **** pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
  	SpinLockAcquire(&xlogctl->info_lck);
  	recptr = xlogctl->recoveryLastRecPtr;
  	SpinLockRelease(&xlogctl->info_lck);
  
! 	snprintf(location, sizeof(location), "%X/%X",
! 			 recptr.xlogid, recptr.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(location));
  }
  
--- 8316,8332 ----
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	TimeLineID	tli;
  	XLogRecPtr	recptr;
  	char		location[MAXFNAMELEN];
  
  	SpinLockAcquire(&xlogctl->info_lck);
+ 	tli = xlogctl->recoveryLastTLI;
  	recptr = xlogctl->recoveryLastRecPtr;
  	SpinLockRelease(&xlogctl->info_lck);
  
! 	snprintf(location, sizeof(location), "%X/%X/%X",
! 			 tli, recptr.xlogid, recptr.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(location));
  }
  
***************
*** 8319,8324 **** pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
--- 8334,8343 ----
   * Compute an xlog file name and decimal byte offset given a WAL location,
   * such as is returned by pg_stop_backup() or pg_xlog_switch().
   *
+  * Also use the tli for the computation if it's given with a location,
+  * such as is returned by pg_last_xlog_receive_location() or
+  * pg_last_xlog_replay_location().
+  *
   * Note that a location exactly at a segment boundary is taken to be in
   * the previous segment.  This is usually the right thing, since the
   * expected usage is to determine which xlog file(s) are ready to archive.
***************
*** 8328,8338 **** pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
--- 8347,8359 ----
  {
  	text	   *location = PG_GETARG_TEXT_P(0);
  	char	   *locationstr;
+ 	unsigned int utli;
  	unsigned int uxlogid;
  	unsigned int uxrecoff;
  	uint32		xlogid;
  	uint32		xlogseg;
  	uint32		xrecoff;
+ 	TimeLineID	tli = ThisTimeLineID;
  	XLogRecPtr	locationpoint;
  	char		xlogfilename[MAXFNAMELEN];
  	Datum		values[2];
***************
*** 8346,8352 **** pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
  	 */
  	locationstr = text_to_cstring(location);
  
! 	if (sscanf(locationstr, "%X/%X", &uxlogid, &uxrecoff) != 2)
  		ereport(ERROR,
  				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  				 errmsg("could not parse transaction log location \"%s\"",
--- 8367,8375 ----
  	 */
  	locationstr = text_to_cstring(location);
  
! 	if (sscanf(locationstr, "%X/%X/%X", &utli, &uxlogid, &uxrecoff) == 3)
! 		tli = (TimeLineID) utli;
! 	else if (sscanf(locationstr, "%X/%X", &uxlogid, &uxrecoff) != 2)
  		ereport(ERROR,
  				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  				 errmsg("could not parse transaction log location \"%s\"",
***************
*** 8371,8377 **** pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
  	 * xlogfilename
  	 */
  	XLByteToPrevSeg(locationpoint, xlogid, xlogseg);
! 	XLogFileName(xlogfilename, ThisTimeLineID, xlogid, xlogseg);
  
  	values[0] = CStringGetTextDatum(xlogfilename);
  	isnull[0] = false;
--- 8394,8400 ----
  	 * xlogfilename
  	 */
  	XLByteToPrevSeg(locationpoint, xlogid, xlogseg);
! 	XLogFileName(xlogfilename, tli, xlogid, xlogseg);
  
  	values[0] = CStringGetTextDatum(xlogfilename);
  	isnull[0] = false;
***************
*** 8397,8418 **** pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
  /*
   * Compute an xlog file name given a WAL location,
   * such as is returned by pg_stop_backup() or pg_xlog_switch().
   */
  Datum
  pg_xlogfile_name(PG_FUNCTION_ARGS)
  {
  	text	   *location = PG_GETARG_TEXT_P(0);
  	char	   *locationstr;
  	unsigned int uxlogid;
  	unsigned int uxrecoff;
  	uint32		xlogid;
  	uint32		xlogseg;
  	XLogRecPtr	locationpoint;
  	char		xlogfilename[MAXFNAMELEN];
  
  	locationstr = text_to_cstring(location);
  
! 	if (sscanf(locationstr, "%X/%X", &uxlogid, &uxrecoff) != 2)
  		ereport(ERROR,
  				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  				 errmsg("could not parse transaction log location \"%s\"",
--- 8420,8449 ----
  /*
   * Compute an xlog file name given a WAL location,
   * such as is returned by pg_stop_backup() or pg_xlog_switch().
+  *
+  * Also use the tli for the computation if it's given with a location,
+  * such as is returned by pg_last_xlog_receive_location() or
+  * pg_last_xlog_replay_location().
   */
  Datum
  pg_xlogfile_name(PG_FUNCTION_ARGS)
  {
  	text	   *location = PG_GETARG_TEXT_P(0);
  	char	   *locationstr;
+ 	unsigned int utli;
  	unsigned int uxlogid;
  	unsigned int uxrecoff;
  	uint32		xlogid;
  	uint32		xlogseg;
+ 	TimeLineID	tli = ThisTimeLineID;
  	XLogRecPtr	locationpoint;
  	char		xlogfilename[MAXFNAMELEN];
  
  	locationstr = text_to_cstring(location);
  
! 	if (sscanf(locationstr, "%X/%X/%X", &utli, &uxlogid, &uxrecoff) == 3)
! 		tli = (TimeLineID) utli;
! 	else if (sscanf(locationstr, "%X/%X", &uxlogid, &uxrecoff) != 2)
  		ereport(ERROR,
  				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  				 errmsg("could not parse transaction log location \"%s\"",
***************
*** 8422,8428 **** pg_xlogfile_name(PG_FUNCTION_ARGS)
  	locationpoint.xrecoff = uxrecoff;
  
  	XLByteToPrevSeg(locationpoint, xlogid, xlogseg);
! 	XLogFileName(xlogfilename, ThisTimeLineID, xlogid, xlogseg);
  
  	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
  }
--- 8453,8459 ----
  	locationpoint.xrecoff = uxrecoff;
  
  	XLByteToPrevSeg(locationpoint, xlogid, xlogseg);
! 	XLogFileName(xlogfilename, tli, xlogid, xlogseg);
  
  	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
  }
#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
Re: Assertion failure twophase.c (2) (testing HS/SR)

On Thu, 2010-03-11 at 11:29 +0200, Heikki Linnakangas wrote:

I'm still not any wiser on what's causing that, but I've fixed the bug
in KnownAssignedXidsMany() now.

Yeh, I've been mulling this over for a few days now and I can't see a
way that could have happened either.

I agree with your fix and the stronger placement of the Assertion.
Thanks.

I will be doing some further investigation in that area as well, over
next week or so.

--
Simon Riggs www.2ndQuadrant.com

#5Erik Rijkers
er@xs4all.nl
In reply to: Erik Rijkers (#1)
Re: Assertion failure twophase.c (3) (testing HS/SR)

On Thu, March 4, 2010 17:00, Erik Rijkers wrote:

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
fix-KnownAssignedXidsRemoveMany-1.patch

pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
(created 2010.03.13 23:49 cvs).

Unfortunately, it does not happen always, or predictably.

patches:
new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
(both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )

(fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)

I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
| psql -1qtA -h /tmp -p 7575 -d replicas

(the copied schemas were together 175 GB)

As I seem to be the only one who finds this, I started looking what could be unique in this
install: and it would be postbio, which we use for its gist-indexing on ranges
(http://pgfoundry.org/projects/postbio/). We use postbio's int_interval type as a column type.
But keep in mind that sometimes the whole dump+restore+replication completes OK.

Other installed modules are:
contrib/btree_gist
contrib/seg
contrib/adminpack

log_line_prefix = '%t %p %d %u start=%s ' # slave

pgsql.sr_hotslave/logfile:

2010-03-13 23:54:59 CET 15765 start=2010-03-13 23:54:59 CET LOG: database system was
interrupted; last known up at 2010-03-13 23:54:31 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: entering standby mode
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: redo starts at 0/1000020
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: consistent recovery state
reached at 0/2000000
2010-03-13 23:55:00 CET 15763 start=2010-03-13 23:54:59 CET LOG: database system is ready to
accept read only connections
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: startup process (PID 15765)
was terminated by signal 6: Aborted
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: terminating any other active
server processes

Maybe I'll try now to setup a similar instance without postbio, to see if the crash still occurs.

hth,

Erik Rijkers

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Erik Rijkers (#5)
Re: Assertion failure twophase.c (3) (testing HS/SR)

Can you still reproduce this or has some of the changes since then fixed
it? We never quite figured out the cause..

Erik Rijkers wrote:

On Thu, March 4, 2010 17:00, Erik Rijkers wrote:

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
fix-KnownAssignedXidsRemoveMany-1.patch

pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
(created 2010.03.13 23:49 cvs).

Unfortunately, it does not happen always, or predictably.

patches:
new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
(both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )

(fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)

I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
| psql -1qtA -h /tmp -p 7575 -d replicas

(the copied schemas were together 175 GB)

As I seem to be the only one who finds this, I started looking what could be unique in this
install: and it would be postbio, which we use for its gist-indexing on ranges
(http://pgfoundry.org/projects/postbio/). We use postbio's int_interval type as a column type.
But keep in mind that sometimes the whole dump+restore+replication completes OK.

Other installed modules are:
contrib/btree_gist
contrib/seg
contrib/adminpack

log_line_prefix = '%t %p %d %u start=%s ' # slave

pgsql.sr_hotslave/logfile:

2010-03-13 23:54:59 CET 15765 start=2010-03-13 23:54:59 CET LOG: database system was
interrupted; last known up at 2010-03-13 23:54:31 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: entering standby mode
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: redo starts at 0/1000020
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: consistent recovery state
reached at 0/2000000
2010-03-13 23:55:00 CET 15763 start=2010-03-13 23:54:59 CET LOG: database system is ready to
accept read only connections
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: startup process (PID 15765)
was terminated by signal 6: Aborted
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: terminating any other active
server processes

Maybe I'll try now to setup a similar instance without postbio, to see if the crash still occurs.

hth,

Erik Rijkers

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Erik Rijkers
er@xs4all.nl
In reply to: Heikki Linnakangas (#6)
Re: Assertion failure twophase.c (3) (testing HS/SR)

On Thu, April 22, 2010 09:53, Heikki Linnakangas wrote:

Can you still reproduce this or has some of the changes since then fixed
it? We never quite figured out the cause..

I don't know for sure:

Unfortunately, it does not happen always, or predictably.

The only thing that I established after that email was sent,
is that the error can also occur without the postbio package
being been installed (this has happened once).

It's a very easy test; I will probably run it a few more times.

Show quoted text

Erik Rijkers wrote:

On Thu, March 4, 2010 17:00, Erik Rijkers wrote:

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
fix-KnownAssignedXidsRemoveMany-1.patch

pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
(created 2010.03.13 23:49 cvs).

Unfortunately, it does not happen always, or predictably.

patches:
new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
(both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )

(fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)

I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
| psql -1qtA -h /tmp -p 7575 -d replicas

(the copied schemas were together 175 GB)

As I seem to be the only one who finds this, I started looking what could be unique in this
install: and it would be postbio, which we use for its gist-indexing on ranges
(http://pgfoundry.org/projects/postbio/). We use postbio's int_interval type as a column type.
But keep in mind that sometimes the whole dump+restore+replication completes OK.

Other installed modules are:
contrib/btree_gist
contrib/seg
contrib/adminpack

log_line_prefix = '%t %p %d %u start=%s ' # slave

pgsql.sr_hotslave/logfile:

2010-03-13 23:54:59 CET 15765 start=2010-03-13 23:54:59 CET LOG: database system was
interrupted; last known up at 2010-03-13 23:54:31 CET
cp: cannot stat
`/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: entering standby mode
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: redo starts at 0/1000020
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: consistent recovery state
reached at 0/2000000
2010-03-13 23:55:00 CET 15763 start=2010-03-13 23:54:59 CET LOG: database system is ready to
accept read only connections
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: startup process (PID 15765)
was terminated by signal 6: Aborted
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: terminating any other active
server processes

Maybe I'll try now to setup a similar instance without postbio, to see if the crash still
occurs.

hth,

Erik Rijkers

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Erik Rijkers (#7)
Re: Assertion failure twophase.c (3) (testing HS/SR)

On Fri, 2010-04-23 at 03:08 +0200, Erik Rijkers wrote:

It's a very easy test; I will probably run it a few more times.

Please share details of your system and hardware.

--
Simon Riggs www.2ndQuadrant.com