8.5alpha3 hot standby crash report (DatabasePath related?)

Started by Robert Treatabout 16 years ago11 messages
#1Robert Treat
xzilla@users.sourceforge.net

Howdy folks,

Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.

Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit

Error in the log was simply this:
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 385) was terminated by signal 11
LOG: terminating any other active server processes

info from core file was:

postgres@postgresdev[/export/home/postgres/pgwork/data/slave]dbx
/export/home/postgres/pgwork/inst/slave/bin/postgres core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your
.dbxrc
Reading postgres
core file header read successfully
Reading ld.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libsocket.so.1
Reading libm.so.2
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
program terminated by signal SEGV (no mapping at the fault address)
0xfedb578c: strlen+0x000c: movl (%eax),%edx
Current function is RelationCacheInitFileInvalidate
4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",
(dbx) where
[1]: strlen(0x0), at 0xfedb578c
[2]: _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976
[3]: snprintf(0x8042e90, 0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c, 0x4f434552, 0x59524556), at 0xfee10734 =>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in "relcache.c"
0x4f434552, 0x59524556), at 0xfee10734
=>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in
"relcache.c"
[5]: xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in "xact.c"
"xact.c"
[6]: StartupXLOG(), line 5834 in "xlog.c"
[7]: StartupProcessMain(), line 8359 in "xlog.c"
[8]: AuxiliaryProcessMain(argc = 2, argv = 0x80475e0), line 408 in "bootstrap.c"
"bootstrap.c"
[9]: StartChildProcess(type = 134510104), line 4218 in "postmaster.c"
[10]: PostmasterMain(argc = 3, argv = 0x84c6e00), line 1061 in "postmaster.c"
[11]: main(argc = 3, argv = 0x84c6e00), line 188 in "main.c" (dbx) print initfilename initfilename = "XLOG" (dbx) list 4237 snprintf(initfilename, sizeof(initfilename), "%s/%s", 4238 DatabasePath, RELCACHE_INIT_FILENAME); 4239 4240 if (beforeSend) 4241 { 4242 /* no interlock needed here */ 4243 unlink(initfilename); 4244 } 4245 else 4246 { (dbx) print DatabasePath DatabasePath = (nil)
(dbx) print initfilename
initfilename = "XLOG"
(dbx) list
4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",
4238 DatabasePath, RELCACHE_INIT_FILENAME);
4239
4240 if (beforeSend)
4241 {
4242 /* no interlock needed here */
4243 unlink(initfilename);
4244 }
4245 else
4246 {
(dbx) print DatabasePath
DatabasePath = (nil)

Theo mentioned that he expects the above would work under Linux, but that we
need DatabasePath to not be null in the above.

FWIW, the SQL I was running on the master did involve two create database
commands back to back.

LMK if you have any questions.

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#1)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:

Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.

Thanks for the report.

--
Simon Riggs www.2ndQuadrant.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Treat (#1)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

Robert Treat <xzilla@users.sourceforge.net> writes:

Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.

Hmm. Looks like it's crashing because DatabasePath is not set yet.
We could probably kluge around that by having this function return
without doing anything if DatabasePath is null, but I wonder if this
doesn't indicate that things are being done in the wrong order.
If DatabasePath isn't set, then there's a whole lot of other
infrastructure that's not good either:

Current function is RelationCacheInitFileInvalidate
4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",
4238 DatabasePath, RELCACHE_INIT_FILENAME);
(dbx) where
[1] strlen(0x0), at 0xfedb578c
[2] _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976
[3] snprintf(0x8042e90, 0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c,
0x4f434552, 0x59524556), at 0xfee10734
=>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in
"relcache.c"
[5] xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in
"xact.c"
[6] StartupXLOG(), line 5834 in "xlog.c"
[7] StartupProcessMain(), line 8359 in "xlog.c"
[8] AuxiliaryProcessMain(argc = 2, argv = 0x80475e0), line 408 in
"bootstrap.c"
[9] StartChildProcess(type = 134510104), line 4218 in "postmaster.c"
[10] PostmasterMain(argc = 3, argv = 0x84c6e00), line 1061 in "postmaster.c"
[11] main(argc = 3, argv = 0x84c6e00), line 188 in "main.c"

regards, tom lane

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#1)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:

Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.

Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit

Error in the log was simply this:
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 385) was terminated by signal 11
LOG: terminating any other active server processes

Please send some more of the prior log, so we can tell the
initialization state at that point.

If you can reproduce it, that would help also. Thanks.

--
Simon Riggs www.2ndQuadrant.com

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#3)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Thu, 2010-01-07 at 16:05 -0500, Tom Lane wrote:

Robert Treat <xzilla@users.sourceforge.net> writes:

Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.

Hmm. Looks like it's crashing because DatabasePath is not set yet.
We could probably kluge around that by having this function return
without doing anything if DatabasePath is null, but I wonder if this
doesn't indicate that things are being done in the wrong order.
If DatabasePath isn't set, then there's a whole lot of other
infrastructure that's not good either:

OK, I think I see this now. Will fix.

--
Simon Riggs www.2ndQuadrant.com

#6Robert Treat
robert@omniti.com
In reply to: Simon Riggs (#4)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Jan 7, 2010, at 4:15 PM, Simon Riggs wrote:

On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:

Doing some testing with 8.5alpha3, my standby crashed this morning
whilst
doing some testing. Seems I have a core file, so thought I would
send a report.

Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by
GCC gcc
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit

Error in the log was simply this:
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 385) was terminated by signal 11
LOG: terminating any other active server processes

Please send some more of the prior log, so we can tell the
initialization state at that point.

Not much more to send really. It's just "restored log file" lines
going back to the startup (as expected)

If you can reproduce it, that would help also. Thanks.

Not sure about reproducing, but if I restart the database, it crashes
again during log replay (stack trace info looks the same).

LOG: database system was interrupted while in recovery at log time
2010-01-07 12:01:57 EST
HINT: If this has occurred more than once some data might be
corrupted and you might need to choose an earlier recovery target.
LOG: starting archive recovery
LOG: restore_command = '/export/home/postgres/pgwork/inst/slave/bin/
pg_standby /export/home/postgres/pgwork/data/.hot_standby %f %p %r'
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
LOG: restored log file "00000001000000030000002A" from archive
LOG: automatic recovery in progress
LOG: initializing recovery connections
LOG: redo starts at 3/2A000020, consistency will be reached at
3/2D000120
LOG: restored log file "00000001000000030000002B" from archive
LOG: restored log file "00000001000000030000002C" from archive
LOG: restored log file "00000001000000030000002D" from archive
LOG: consistent recovery state reached at 3/2D000120
LOG: database system is ready to accept read only connections
LOG: restored log file "00000001000000030000002E" from archive
LOG: restored log file "00000001000000030000002F" from archive
LOG: restored log file "000000010000000300000030" from archive
LOG: restored log file "000000010000000300000031" from archive
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 4397) was terminated by signal 11
LOG: terminating any other active server processes

--
Robert Treat
http://www.omniti.com/

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#6)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:

Not much more to send really.

No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.

--
Simon Riggs www.2ndQuadrant.com

#8Robert Treat
robert@omniti.com
In reply to: Simon Riggs (#7)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote:

On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:

Not much more to send really.

No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.

Is anyone doing regular testing on non-linux? (to the extent that
this is a problem due to Solaris ideas about snprintf).

Robert Treat
http://www.omniti.com/

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#8)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Thu, 2010-01-07 at 18:02 -0500, Robert Treat wrote:

On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote:

On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:

Not much more to send really.

No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.

Is anyone doing regular testing on non-linux? (to the extent that
this is a problem due to Solaris ideas about snprintf).

OK, understand now. The Solaris test failed in a place where it would
have been good if Linux had done so; this has revealed that the relcache
init file deletion required in some places was actually failing. The
path was never set correctly in the first place, it appears. I've
updated the TODO to reflect the new bug report.

I've almost finished hacking a solution. I'll test and commit tomorrow,
since it's a fairly standalone problem, if fiddly.

--
Simon Riggs www.2ndQuadrant.com

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#9)
1 attachment(s)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote:

I'll test and commit tomorrow, since it's a fairly standalone problem

Fix attached, thanks for testing.

Works for me and I don't expect it to fail on Solaris, since the root
cause of the failure has been addressed and a correctly designed test
executed.

I will wait a day for your confirmation and/or other's comments.

--
Simon Riggs www.2ndQuadrant.com

Attachments:

relcache_inval.patchtext/x-patch; charset=UTF-8; name=relcache_inval.patchDownload
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 4404,4423 **** xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
  		 * maintain the same order of invalidation then release locks
  		 * as occurs in 	.
  		 */
! 		if (xlrec->nmsgs > 0)
! 		{
! 			/*
! 			 * Relcache init file invalidation requires processing both
! 			 * before and after we send the SI messages. See AtEOXact_Inval()
! 			 */
! 			if (XactCompletionRelcacheInitFileInval(xlrec))
! 				RelationCacheInitFileInvalidate(true);
! 
! 			SendSharedInvalidMessages(inval_msgs, xlrec->nmsgs);
! 
! 			if (XactCompletionRelcacheInitFileInval(xlrec))
! 				RelationCacheInitFileInvalidate(false);
! 		}
  
  		/*
  		 * Release locks, if any. We do this for both two phase and normal
--- 4404,4411 ----
  		 * maintain the same order of invalidation then release locks
  		 * as occurs in 	.
  		 */
! 		ProcessCommittedInvalidationMessages(inval_msgs, xlrec->nmsgs,
! 									XactCompletionRelcacheInitFileInval(xlrec));
  
  		/*
  		 * Release locks, if any. We do this for both two phase and normal
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
***************
*** 89,94 ****
--- 89,95 ----
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
  #include "catalog/catalog.h"
+ #include "catalog/pg_tablespace.h"
  #include "miscadmin.h"
  #include "storage/sinval.h"
  #include "storage/smgr.h"
***************
*** 871,876 **** xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
--- 872,981 ----
  	return numSharedInvalidMessagesArray;
  }
  
+ #define RecoveryRelationCacheInitFileInvalidate(dbo, tbo, tf) \
+ { \
+ 	DatabasePath = GetDatabasePath(dbo, tbo); \
+ 	elog(trace_recovery(DEBUG4), "removing relcache init file in %s", DatabasePath); \
+ 	RelationCacheInitFileInvalidate(tf); \
+ 	pfree(DatabasePath); \
+ }
+ 
+ /*
+  * ProcessCommittedInvalidationMessages is executed by xact_redo_commit()
+  * to process invalidation messages added to commit records.
+  *
+  * If we have to invalidate the relcache init file we need to extract
+  * the database id from each message so we can correctly locate the database
+  * path and so remove that database's init file. We note that the relcache
+  * only contains entries for catalog tables from a single database, or
+  * shared relations. There are smgr invalidations that reference other
+  * databases but they never cause relcache file invalidations.
+  * So we need not scan pg_database to discover tablespace oids, since
+  * we only have either global or default tablespaces. Just lucky, I guess.
+  *
+  * Relcache init file invalidation requires processing both
+  * before and after we send the SI messages. See AtEOXact_Inval()
+  *
+  * We deliberately avoid SetDatabasePath() since it is intended to be used
+  * only once by normal backends, so we just use DatabasePath directly and
+  * immediately pfree after use.
+  */
+ void
+ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
+ 										int nmsgs, bool RelcacheInitFileInval)
+ {
+ 	Oid 		dboid = 0;
+ 	bool		invalidate_global = false;
+ 
+ 	if (nmsgs > 0)
+ 		elog(trace_recovery(DEBUG4), "replaying commit with %d messages%s", nmsgs,
+ 					(RelcacheInitFileInval ? " and relcache file invalidation" : ""));
+ 	else
+ 		return;
+ 
+ 	if (RelcacheInitFileInval)
+ 	{
+ 		int			i;
+ 
+ 		/*
+ 		 * Check messages to record dboid
+ 		 */
+ 		for (i = 0; i < nmsgs; i++)
+ 		{
+ 			SharedInvalidationMessage *inval_msg = &(msgs[i]);
+ 			Oid 		loop_dboid = 0;
+ 
+ 			/*
+ 			 * Extract the database Oid from the message
+ 			 */
+ 			if (inval_msg->id >= 0)
+ 				loop_dboid = inval_msg->cc.dbId;
+ 			else if (inval_msg->id == SHAREDINVALRELCACHE_ID)
+ 				loop_dboid = inval_msg->rc.dbId;
+ 			else
+ 			{
+ 				/* 
+ 				 * Invalidation message is a SHAREDINVALSMGR_ID
+ 				 * which never cause relcache file invalidation,
+ 				 * so we ignore them, whichever database they're for.
+ 				 */
+ 			}
+ 
+ 			if (loop_dboid == 0)
+ 				invalidate_global = true;
+ 			else
+ 			{
+ 				Assert(dboid == 0 || dboid == loop_dboid);
+ 				dboid = loop_dboid;
+ 			}
+ 		}
+ 
+ 		/*
+ 		 * If shared, dboid will be the global tablespace, otherwise it will
+ 		 * be a shared catalog relation in the default tablespace.
+ 		 */
+ 		if (invalidate_global)
+ 			RecoveryRelationCacheInitFileInvalidate(0, GLOBALTABLESPACE_OID, true);
+ 
+ 		if (dboid != 0)
+ 			RecoveryRelationCacheInitFileInvalidate(dboid, DEFAULTTABLESPACE_OID, true);
+ 	}
+ 
+ 	SendSharedInvalidMessages(msgs, nmsgs);
+ 
+ 	if (RelcacheInitFileInval)
+ 	{
+ 		/*
+ 		 * Second invalidation, very similar to above.
+ 		 */
+ 		if (invalidate_global)
+ 			RecoveryRelationCacheInitFileInvalidate(0, GLOBALTABLESPACE_OID, false);
+ 
+ 		if (dboid != 0)
+ 			RecoveryRelationCacheInitFileInvalidate(dboid, DEFAULTTABLESPACE_OID, false);
+ 	}
+ }
+ 
  /*
   * AtEOXact_Inval
   *		Process queued-up invalidation messages at end of main transaction.
*** a/src/include/storage/sinval.h
--- b/src/include/storage/sinval.h
***************
*** 102,106 **** extern bool DisableCatchupInterrupt(void);
--- 102,108 ----
  
  extern int xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
  										bool *RelcacheInitFileInval);
+ extern void ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
+ 										int nmsgs, bool RelcacheInitFileInval);
  
  #endif   /* SINVAL_H */
#11Robert Treat
robert@omniti.com
In reply to: Simon Riggs (#10)
Re: 8.5alpha3 hot standby crash report (DatabasePath related?)

On Jan 8, 2010, at 7:36 AM, Simon Riggs wrote:

On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote:

I'll test and commit tomorrow, since it's a fairly standalone problem

Fix attached, thanks for testing.

Works for me and I don't expect it to fail on Solaris, since the root
cause of the failure has been addressed and a correctly designed test
executed.

I will wait a day for your confirmation and/or other's comments.

Looks good from my end, thanks Simon.

Robert Treat
http://www.omniti.com