8.5alpha3 hot standby crash report (DatabasePath related?)
Howdy folks,
Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.
Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bit
Error in the log was simply this:
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 385) was terminated by signal 11
LOG: terminating any other active server processes
info from core file was:
postgres@postgresdev[/export/home/postgres/pgwork/data/slave]dbx
/export/home/postgres/pgwork/inst/slave/bin/postgres core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your
.dbxrc
Reading postgres
core file header read successfully
Reading ld.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libsocket.so.1
Reading libm.so.2
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
program terminated by signal SEGV (no mapping at the fault address)
0xfedb578c: strlen+0x000c: movl (%eax),%edx
Current function is RelationCacheInitFileInvalidate
4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",
(dbx) where
[1]: strlen(0x0), at 0xfedb578c
[2]: _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976
[3]: snprintf(0x8042e90, 0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c, 0x4f434552, 0x59524556), at 0xfee10734 =>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in "relcache.c"
0x4f434552, 0x59524556), at 0xfee10734
=>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in
"relcache.c"
[5]: xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in "xact.c"
"xact.c"
[6]: StartupXLOG(), line 5834 in "xlog.c"
[7]: StartupProcessMain(), line 8359 in "xlog.c"
[8]: AuxiliaryProcessMain(argc = 2, argv = 0x80475e0), line 408 in "bootstrap.c"
"bootstrap.c"
[9]: StartChildProcess(type = 134510104), line 4218 in "postmaster.c"
[10]: PostmasterMain(argc = 3, argv = 0x84c6e00), line 1061 in "postmaster.c"
[11]: main(argc = 3, argv = 0x84c6e00), line 188 in "main.c" (dbx) print initfilename initfilename = "XLOG" (dbx) list 4237 snprintf(initfilename, sizeof(initfilename), "%s/%s", 4238 DatabasePath, RELCACHE_INIT_FILENAME); 4239 4240 if (beforeSend) 4241 { 4242 /* no interlock needed here */ 4243 unlink(initfilename); 4244 } 4245 else 4246 { (dbx) print DatabasePath DatabasePath = (nil)
(dbx) print initfilename
initfilename = "XLOG"
(dbx) list
4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",
4238 DatabasePath, RELCACHE_INIT_FILENAME);
4239
4240 if (beforeSend)
4241 {
4242 /* no interlock needed here */
4243 unlink(initfilename);
4244 }
4245 else
4246 {
(dbx) print DatabasePath
DatabasePath = (nil)
Theo mentioned that he expects the above would work under Linux, but that we
need DatabasePath to not be null in the above.
FWIW, the SQL I was running on the master did involve two create database
commands back to back.
LMK if you have any questions.
--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:
Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.
Thanks for the report.
--
Simon Riggs www.2ndQuadrant.com
Robert Treat <xzilla@users.sourceforge.net> writes:
Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.
Hmm. Looks like it's crashing because DatabasePath is not set yet.
We could probably kluge around that by having this function return
without doing anything if DatabasePath is null, but I wonder if this
doesn't indicate that things are being done in the wrong order.
If DatabasePath isn't set, then there's a whole lot of other
infrastructure that's not good either:
Current function is RelationCacheInitFileInvalidate
4237 snprintf(initfilename, sizeof(initfilename), "%s/%s",
4238 DatabasePath, RELCACHE_INIT_FILENAME);
(dbx) where
[1] strlen(0x0), at 0xfedb578c
[2] _ndoprnt(0x8417d81, 0x8042e7c, 0x8042e50, 0x0), at 0xfee0d976
[3] snprintf(0x8042e90, 0x400, 0x8417d7f, 0x0, 0x8417d30, 0x2f676f6c,
0x4f434552, 0x59524556), at 0xfee10734
=>[4] RelationCacheInitFileInvalidate(beforeSend = '\0'), line 4237 in
"relcache.c"
[5] xact_redo_commit(xlrec = 0x84ed434, xid = 0, lsn = RECORD), line 4414 in
"xact.c"
[6] StartupXLOG(), line 5834 in "xlog.c"
[7] StartupProcessMain(), line 8359 in "xlog.c"
[8] AuxiliaryProcessMain(argc = 2, argv = 0x80475e0), line 408 in
"bootstrap.c"
[9] StartChildProcess(type = 134510104), line 4218 in "postmaster.c"
[10] PostmasterMain(argc = 3, argv = 0x84c6e00), line 1061 in "postmaster.c"
[11] main(argc = 3, argv = 0x84c6e00), line 188 in "main.c"
regards, tom lane
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:
Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by GCC gcc
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bitError in the log was simply this:
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 385) was terminated by signal 11
LOG: terminating any other active server processes
Please send some more of the prior log, so we can tell the
initialization state at that point.
If you can reproduce it, that would help also. Thanks.
--
Simon Riggs www.2ndQuadrant.com
On Thu, 2010-01-07 at 16:05 -0500, Tom Lane wrote:
Robert Treat <xzilla@users.sourceforge.net> writes:
Doing some testing with 8.5alpha3, my standby crashed this morning whilst
doing some testing. Seems I have a core file, so thought I would send a report.Hmm. Looks like it's crashing because DatabasePath is not set yet.
We could probably kluge around that by having this function return
without doing anything if DatabasePath is null, but I wonder if this
doesn't indicate that things are being done in the wrong order.
If DatabasePath isn't set, then there's a whole lot of other
infrastructure that's not good either:
OK, I think I see this now. Will fix.
--
Simon Riggs www.2ndQuadrant.com
On Jan 7, 2010, at 4:15 PM, Simon Riggs wrote:
On Thu, 2010-01-07 at 14:41 -0500, Robert Treat wrote:
Doing some testing with 8.5alpha3, my standby crashed this morning
whilst
doing some testing. Seems I have a core file, so thought I would
send a report.Version: PostgreSQL 8.5alpha3 on i386-pc-solaris2.10, compiled by
GCC gcc
(GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 32-bitError in the log was simply this:
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 385) was terminated by signal 11
LOG: terminating any other active server processesPlease send some more of the prior log, so we can tell the
initialization state at that point.
Not much more to send really. It's just "restored log file" lines
going back to the startup (as expected)
If you can reproduce it, that would help also. Thanks.
Not sure about reproducing, but if I restart the database, it crashes
again during log replay (stack trace info looks the same).
LOG: database system was interrupted while in recovery at log time
2010-01-07 12:01:57 EST
HINT: If this has occurred more than once some data might be
corrupted and you might need to choose an earlier recovery target.
LOG: starting archive recovery
LOG: restore_command = '/export/home/postgres/pgwork/inst/slave/bin/
pg_standby /export/home/postgres/pgwork/data/.hot_standby %f %p %r'
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
cp: cannot access /export/home/postgres/pgwork/data/.hot_standby/
00000001.history
LOG: restored log file "00000001000000030000002A" from archive
LOG: automatic recovery in progress
LOG: initializing recovery connections
LOG: redo starts at 3/2A000020, consistency will be reached at
3/2D000120
LOG: restored log file "00000001000000030000002B" from archive
LOG: restored log file "00000001000000030000002C" from archive
LOG: restored log file "00000001000000030000002D" from archive
LOG: consistent recovery state reached at 3/2D000120
LOG: database system is ready to accept read only connections
LOG: restored log file "00000001000000030000002E" from archive
LOG: restored log file "00000001000000030000002F" from archive
LOG: restored log file "000000010000000300000030" from archive
LOG: restored log file "000000010000000300000031" from archive
LOG: restored log file "000000010000000300000032" from archive
LOG: startup process (PID 4397) was terminated by signal 11
LOG: terminating any other active server processes
--
Robert Treat
http://www.omniti.com/
On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:
Not much more to send really.
No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.
--
Simon Riggs www.2ndQuadrant.com
On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote:
On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:
Not much more to send really.
No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.
Is anyone doing regular testing on non-linux? (to the extent that
this is a problem due to Solaris ideas about snprintf).
Robert Treat
http://www.omniti.com/
On Thu, 2010-01-07 at 18:02 -0500, Robert Treat wrote:
On Jan 7, 2010, at 5:18 PM, Simon Riggs wrote:
On Thu, 2010-01-07 at 16:56 -0500, Robert Treat wrote:
Not much more to send really.
No problem. I can see what causes it, nothing more required, thanks.
What I don't fully understand yet is why the error hasn't shown itself
before, because it appears such a basic error.Is anyone doing regular testing on non-linux? (to the extent that
this is a problem due to Solaris ideas about snprintf).
OK, understand now. The Solaris test failed in a place where it would
have been good if Linux had done so; this has revealed that the relcache
init file deletion required in some places was actually failing. The
path was never set correctly in the first place, it appears. I've
updated the TODO to reflect the new bug report.
I've almost finished hacking a solution. I'll test and commit tomorrow,
since it's a fairly standalone problem, if fiddly.
--
Simon Riggs www.2ndQuadrant.com
On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote:
I'll test and commit tomorrow, since it's a fairly standalone problem
Fix attached, thanks for testing.
Works for me and I don't expect it to fail on Solaris, since the root
cause of the failure has been addressed and a correctly designed test
executed.
I will wait a day for your confirmation and/or other's comments.
--
Simon Riggs www.2ndQuadrant.com
Attachments:
relcache_inval.patchtext/x-patch; charset=UTF-8; name=relcache_inval.patchDownload
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 4404,4423 **** xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
* maintain the same order of invalidation then release locks
* as occurs in .
*/
! if (xlrec->nmsgs > 0)
! {
! /*
! * Relcache init file invalidation requires processing both
! * before and after we send the SI messages. See AtEOXact_Inval()
! */
! if (XactCompletionRelcacheInitFileInval(xlrec))
! RelationCacheInitFileInvalidate(true);
!
! SendSharedInvalidMessages(inval_msgs, xlrec->nmsgs);
!
! if (XactCompletionRelcacheInitFileInval(xlrec))
! RelationCacheInitFileInvalidate(false);
! }
/*
* Release locks, if any. We do this for both two phase and normal
--- 4404,4411 ----
* maintain the same order of invalidation then release locks
* as occurs in .
*/
! ProcessCommittedInvalidationMessages(inval_msgs, xlrec->nmsgs,
! XactCompletionRelcacheInitFileInval(xlrec));
/*
* Release locks, if any. We do this for both two phase and normal
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
***************
*** 89,94 ****
--- 89,95 ----
#include "access/twophase_rmgr.h"
#include "access/xact.h"
#include "catalog/catalog.h"
+ #include "catalog/pg_tablespace.h"
#include "miscadmin.h"
#include "storage/sinval.h"
#include "storage/smgr.h"
***************
*** 871,876 **** xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
--- 872,981 ----
return numSharedInvalidMessagesArray;
}
+ #define RecoveryRelationCacheInitFileInvalidate(dbo, tbo, tf) \
+ { \
+ DatabasePath = GetDatabasePath(dbo, tbo); \
+ elog(trace_recovery(DEBUG4), "removing relcache init file in %s", DatabasePath); \
+ RelationCacheInitFileInvalidate(tf); \
+ pfree(DatabasePath); \
+ }
+
+ /*
+ * ProcessCommittedInvalidationMessages is executed by xact_redo_commit()
+ * to process invalidation messages added to commit records.
+ *
+ * If we have to invalidate the relcache init file we need to extract
+ * the database id from each message so we can correctly locate the database
+ * path and so remove that database's init file. We note that the relcache
+ * only contains entries for catalog tables from a single database, or
+ * shared relations. There are smgr invalidations that reference other
+ * databases but they never cause relcache file invalidations.
+ * So we need not scan pg_database to discover tablespace oids, since
+ * we only have either global or default tablespaces. Just lucky, I guess.
+ *
+ * Relcache init file invalidation requires processing both
+ * before and after we send the SI messages. See AtEOXact_Inval()
+ *
+ * We deliberately avoid SetDatabasePath() since it is intended to be used
+ * only once by normal backends, so we just use DatabasePath directly and
+ * immediately pfree after use.
+ */
+ void
+ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
+ int nmsgs, bool RelcacheInitFileInval)
+ {
+ Oid dboid = 0;
+ bool invalidate_global = false;
+
+ if (nmsgs > 0)
+ elog(trace_recovery(DEBUG4), "replaying commit with %d messages%s", nmsgs,
+ (RelcacheInitFileInval ? " and relcache file invalidation" : ""));
+ else
+ return;
+
+ if (RelcacheInitFileInval)
+ {
+ int i;
+
+ /*
+ * Check messages to record dboid
+ */
+ for (i = 0; i < nmsgs; i++)
+ {
+ SharedInvalidationMessage *inval_msg = &(msgs[i]);
+ Oid loop_dboid = 0;
+
+ /*
+ * Extract the database Oid from the message
+ */
+ if (inval_msg->id >= 0)
+ loop_dboid = inval_msg->cc.dbId;
+ else if (inval_msg->id == SHAREDINVALRELCACHE_ID)
+ loop_dboid = inval_msg->rc.dbId;
+ else
+ {
+ /*
+ * Invalidation message is a SHAREDINVALSMGR_ID
+ * which never cause relcache file invalidation,
+ * so we ignore them, whichever database they're for.
+ */
+ }
+
+ if (loop_dboid == 0)
+ invalidate_global = true;
+ else
+ {
+ Assert(dboid == 0 || dboid == loop_dboid);
+ dboid = loop_dboid;
+ }
+ }
+
+ /*
+ * If shared, dboid will be the global tablespace, otherwise it will
+ * be a shared catalog relation in the default tablespace.
+ */
+ if (invalidate_global)
+ RecoveryRelationCacheInitFileInvalidate(0, GLOBALTABLESPACE_OID, true);
+
+ if (dboid != 0)
+ RecoveryRelationCacheInitFileInvalidate(dboid, DEFAULTTABLESPACE_OID, true);
+ }
+
+ SendSharedInvalidMessages(msgs, nmsgs);
+
+ if (RelcacheInitFileInval)
+ {
+ /*
+ * Second invalidation, very similar to above.
+ */
+ if (invalidate_global)
+ RecoveryRelationCacheInitFileInvalidate(0, GLOBALTABLESPACE_OID, false);
+
+ if (dboid != 0)
+ RecoveryRelationCacheInitFileInvalidate(dboid, DEFAULTTABLESPACE_OID, false);
+ }
+ }
+
/*
* AtEOXact_Inval
* Process queued-up invalidation messages at end of main transaction.
*** a/src/include/storage/sinval.h
--- b/src/include/storage/sinval.h
***************
*** 102,106 **** extern bool DisableCatchupInterrupt(void);
--- 102,108 ----
extern int xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
bool *RelcacheInitFileInval);
+ extern void ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
+ int nmsgs, bool RelcacheInitFileInval);
#endif /* SINVAL_H */
On Jan 8, 2010, at 7:36 AM, Simon Riggs wrote:
On Fri, 2010-01-08 at 01:26 +0000, Simon Riggs wrote:
I'll test and commit tomorrow, since it's a fairly standalone problem
Fix attached, thanks for testing.
Works for me and I don't expect it to fail on Solaris, since the root
cause of the failure has been addressed and a correctly designed test
executed.I will wait a day for your confirmation and/or other's comments.
Looks good from my end, thanks Simon.
Robert Treat
http://www.omniti.com