failed runcheck

Started by Patrick Welcheabout 25 years ago11 messages
#1Patrick Welche
prlw1@newn.cam.ac.uk

First a core dump which can be relieved by:

Index: catalog.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/catalog.c,v
retrieving revision 1.34
diff -c -u -r1.34 catalog.c
--- catalog.c	2000/10/16 14:52:02	1.34
+++ catalog.c	2000/10/21 17:07:09
@@ -173,7 +173,7 @@
 bool
 IsSystemRelationName(const char *relname)
 {
-	if (relname[0] && relname[1] && relname[2])
+	if (relname && relname[0] && relname[1] && relname[2])
 		return (relname[0] == 'p' &&
 				relname[1] == 'g' &&
 				relname[2] == '_');

(symptoms at the end of message)

But now the bit I don't see how to solve: the regression postmaster doesn't
startup because it can't find tmp_check/data/base/1/1259. The only files I
see are 1/{1255,PG_VERSION}. Where does 1259 come from?

Cheers,

Patrick

#0 IsSystemRelationName (relname=0x0) at catalog.c:176
#1 0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
#2 0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
#3 0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,
att=0x8173600) at relcache.c:1193
#4 0x8120c12 in RelationCacheInitialize () at relcache.c:1953
#5 0x81266b3 in InitPostgres (dbname=0xbfbfd666 "template1", username=0x0)
at postinit.c:329
#6 0x807dde0 in BootstrapMain (argc=7, argv=0xbfbfd510) at bootstrap.c:358
#7 0x80bc67c in main (argc=8, argv=0xbfbfd50c) at main.c:119
#8 0x806367e in ___start ()
(gdb) print *relation
$3 = {rd_fd = -1, rd_nblocks = 0, rd_refcnt = 0, rd_myxactonly = 0 '\000',
rd_isnailed = 0 '\000', rd_unlinked = 0 '\000', rd_indexfound = 0 '\000',
rd_uniqueindex = 0 '\000', rd_am = 0x1000001, rd_rel = 0x0, rd_id = 0,
rd_indexlist = 0x82af0a0, rd_lockInfo = {lockRelId = {relId = 0, dbId = 0}},
rd_att = 0x0, rd_rules = 0x0, rd_rulescxt = 0x82af128, rd_istrat = 0x0,
rd_support = 0x0, trigdesc = 0x0}

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Patrick Welche (#1)
Re: failed runcheck

Applied

First a core dump which can be relieved by:

Index: catalog.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/catalog.c,v
retrieving revision 1.34
diff -c -u -r1.34 catalog.c
--- catalog.c	2000/10/16 14:52:02	1.34
+++ catalog.c	2000/10/21 17:07:09
@@ -173,7 +173,7 @@
bool
IsSystemRelationName(const char *relname)
{
-	if (relname[0] && relname[1] && relname[2])
+	if (relname && relname[0] && relname[1] && relname[2])
return (relname[0] == 'p' &&
relname[1] == 'g' &&
relname[2] == '_');

(symptoms at the end of message)

But now the bit I don't see how to solve: the regression postmaster doesn't
startup because it can't find tmp_check/data/base/1/1259. The only files I
see are 1/{1255,PG_VERSION}. Where does 1259 come from?

Cheers,

Patrick

#0 IsSystemRelationName (relname=0x0) at catalog.c:176
#1 0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
#2 0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
#3 0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,
att=0x8173600) at relcache.c:1193
#4 0x8120c12 in RelationCacheInitialize () at relcache.c:1953
#5 0x81266b3 in InitPostgres (dbname=0xbfbfd666 "template1", username=0x0)
at postinit.c:329
#6 0x807dde0 in BootstrapMain (argc=7, argv=0xbfbfd510) at bootstrap.c:358
#7 0x80bc67c in main (argc=8, argv=0xbfbfd50c) at main.c:119
#8 0x806367e in ___start ()
(gdb) print *relation
$3 = {rd_fd = -1, rd_nblocks = 0, rd_refcnt = 0, rd_myxactonly = 0 '\000',
rd_isnailed = 0 '\000', rd_unlinked = 0 '\000', rd_indexfound = 0 '\000',
rd_uniqueindex = 0 '\000', rd_am = 0x1000001, rd_rel = 0x0, rd_id = 0,
rd_indexlist = 0x82af0a0, rd_lockInfo = {lockRelId = {relId = 0, dbId = 0}},
rd_att = 0x0, rd_rules = 0x0, rd_rulescxt = 0x82af128, rd_istrat = 0x0,
rd_support = 0x0, trigdesc = 0x0}

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#3Vadim Mikheev
vmikheev@sectorbase.com
In reply to: Patrick Welche (#1)
Re: failed runcheck

Did you run make distclean? I've run regtests before committing changes.

Vadim

----- Original Message -----
From: "Patrick Welche" <prlw1@newn.cam.ac.uk>
To: <pgsql-hackers@postgresql.org>
Sent: Saturday, October 21, 2000 10:17 AM
Subject: [HACKERS] failed runcheck

First a core dump which can be relieved by:

Index: catalog.c
===================================================================
RCS file:

/home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/catalog.c,v

retrieving revision 1.34
diff -c -u -r1.34 catalog.c
--- catalog.c 2000/10/16 14:52:02 1.34
+++ catalog.c 2000/10/21 17:07:09
@@ -173,7 +173,7 @@
bool
IsSystemRelationName(const char *relname)
{
- if (relname[0] && relname[1] && relname[2])
+ if (relname && relname[0] && relname[1] && relname[2])
return (relname[0] == 'p' &&
relname[1] == 'g' &&
relname[2] == '_');

(symptoms at the end of message)

But now the bit I don't see how to solve: the regression postmaster

doesn't

startup because it can't find tmp_check/data/base/1/1259. The only files I
see are 1/{1255,PG_VERSION}. Where does 1259 come from?

Cheers,

Patrick

#0 IsSystemRelationName (relname=0x0) at catalog.c:176
#1 0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
#2 0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
#3 0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,
att=0x8173600) at relcache.c:1193
#4 0x8120c12 in RelationCacheInitialize () at relcache.c:1953
#5 0x81266b3 in InitPostgres (dbname=0xbfbfd666 "template1",

username=0x0)

at postinit.c:329
#6 0x807dde0 in BootstrapMain (argc=7, argv=0xbfbfd510) at

bootstrap.c:358

#7 0x80bc67c in main (argc=8, argv=0xbfbfd50c) at main.c:119
#8 0x806367e in ___start ()
(gdb) print *relation
$3 = {rd_fd = -1, rd_nblocks = 0, rd_refcnt = 0, rd_myxactonly = 0 '\000',
rd_isnailed = 0 '\000', rd_unlinked = 0 '\000', rd_indexfound = 0

'\000',

rd_uniqueindex = 0 '\000', rd_am = 0x1000001, rd_rel = 0x0, rd_id = 0,
rd_indexlist = 0x82af0a0, rd_lockInfo = {lockRelId = {relId = 0, dbId =

0}},

Show quoted text

rd_att = 0x0, rd_rules = 0x0, rd_rulescxt = 0x82af128, rd_istrat = 0x0,
rd_support = 0x0, trigdesc = 0x0}

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Patrick Welche (#1)
Re: failed runcheck

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

[ core dump due to ]
#0 IsSystemRelationName (relname=0x0) at catalog.c:176
#1 0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
#2 0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
#3 0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,
att=0x8173600) at relcache.c:1193
#4 0x8120c12 in RelationCacheInitialize () at relcache.c:1953

and proposes to fix this by having IsSystemRelationName make an
arbitrary decision about whether a NULL input pointer should be
considered to represent a system relation name or not. I do not
like that, because AFAICS the decision is completely arbitrary.
IsSystemRelationName shouldn't be called with a NULL pointer
in the first place, and it has every right to cause a coredump
if that happens.

It looks to me like the immediate bug here is that
RelationGetPhysicalRelationName is returning a NULL pointer. Probably
there are some missing or out-of-order steps in relcache initialization?

I've been offline for a couple days due to DSL line failure :-( so
I haven't seen Vadim's latest checkins. But I'm betting this is a bug
in the changes to use OIDs as physical relnames. Do we even need
RelationGetPhysicalRelationName anymore, and if so, what does it mean?

Next question: why is RelationInitLockInfo using
RelationGetPhysicalRelationName to get the input data for
IsSharedSystemRelationName --- shouldn't that be a test on logical
relation name? Or maybe the entire premise of
IsSharedSystemRelationName is bogus now, and we ought to use some other
way to decide if a relation is cross-database or not?

regards, tom lane

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: failed runcheck

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

[ core dump due to ]
#0 IsSystemRelationName (relname=0x0) at catalog.c:176
#1 0x807ed9a in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
#2 0x80e9272 in RelationInitLockInfo (relation=0x82af018) at lmgr.c:119
#3 0x81202ef in formrdesc (relationName=0x816ad7e "pg_class", natts=22,
att=0x8173600) at relcache.c:1193
#4 0x8120c12 in RelationCacheInitialize () at relcache.c:1953

and proposes to fix this by having IsSystemRelationName make an
arbitrary decision about whether a NULL input pointer should be
considered to represent a system relation name or not. I do not
like that, because AFAICS the decision is completely arbitrary.
IsSystemRelationName shouldn't be called with a NULL pointer
in the first place, and it has every right to cause a coredump
if that happens.

I have removed the change. It will dump core again.

It looks to me like the immediate bug here is that
RelationGetPhysicalRelationName is returning a NULL pointer. Probably
there are some missing or out-of-order steps in relcache initialization?

I've been offline for a couple days due to DSL line failure :-( so

That is bad. Mine has been great, but I had my first DSL burp for 3
hours on Friday after 4 months of continuous uptime.

I haven't seen Vadim's latest checkins. But I'm betting this is a bug
in the changes to use OIDs as physical relnames. Do we even need
RelationGetPhysicalRelationName anymore, and if so, what does it mean?

Good bet.

Next question: why is RelationInitLockInfo using
RelationGetPhysicalRelationName to get the input data for
IsSharedSystemRelationName --- shouldn't that be a test on logical
relation name? Or maybe the entire premise of
IsSharedSystemRelationName is bogus now, and we ought to use some other
way to decide if a relation is cross-database or not?

No, because if they create a temp table that masks a system table in the
current session, you want the physical name so it can know if it is a
real system table, or a temp/fake one.

You can ask why would someone try this, but it will work, or do
something while trying.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#5)
Re: failed runcheck

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Next question: why is RelationInitLockInfo using
RelationGetPhysicalRelationName to get the input data for
IsSharedSystemRelationName --- shouldn't that be a test on logical
relation name? Or maybe the entire premise of
IsSharedSystemRelationName is bogus now, and we ought to use some other
way to decide if a relation is cross-database or not?

No, because if they create a temp table that masks a system table in the
current session, you want the physical name so it can know if it is a
real system table, or a temp/fake one.

Well, you clearly don't want to be fooled by temp relations. I was
sorta visualizing a check based on relation OIDs instead of names...

regards, tom lane

#7Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Vadim Mikheev (#3)
Re: failed runcheck

On Sat, Oct 21, 2000 at 01:48:39PM -0700, Vadim Mikheev wrote:

Did you run make distclean? I've run regtests before committing changes.

Just made sure - different computer - fresh cvs update/distclean/configure/make
cd src/test/regress
gmake clean
gmake all
gmake runcheck

same coredump

#1 0x807f4be in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
^^^^^^^^^^^
ie. relname[0] at catalog.c:176 is dereferencing a null pointer.

Cheers,

Patrick

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Patrick Welche (#7)
Re: failed runcheck

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

On Sat, Oct 21, 2000 at 01:48:39PM -0700, Vadim Mikheev wrote:

Did you run make distclean? I've run regtests before committing changes.

Just made sure - different computer - fresh cvs update/distclean/configure/make
same coredump

#1 0x807f4be in IsSharedSystemRelationName (relname=0x0) at catalog.c:197
^^^^^^^^^^^
ie. relname[0] at catalog.c:176 is dereferencing a null pointer.

Interesting. Current sources pass regress tests on my machine, same as
for Vadim. I think you have found a platform-specific bug.

Could you dig into it a little further and try to determine where the
NULL is coming from?

regards, tom lane

#9Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Tom Lane (#6)
Re: failed runcheck

On Sun, Oct 22, 2000 at 01:26:07AM -0400, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Next question: why is RelationInitLockInfo using
RelationGetPhysicalRelationName to get the input data for
IsSharedSystemRelationName --- shouldn't that be a test on logical
relation name? Or maybe the entire premise of
IsSharedSystemRelationName is bogus now, and we ought to use some other
way to decide if a relation is cross-database or not?

No, because if they create a temp table that masks a system table in the
current session, you want the physical name so it can know if it is a
real system table, or a temp/fake one.

Well, you clearly don't want to be fooled by temp relations. I was
sorta visualizing a check based on relation OIDs instead of names...

Well, when I did a test implementation of OID filenames, lo these many
moons ago, I hacked around this problem by adding the (fixed) shared
system table oids to the static array that is searched for matches by
IsSharedSystemRelationName.

Admittedly, a hack, but it got past all the regresion tests.

Ross
--
Open source code is like a natural resource, it's the result of providing
food and sunshine to programmers, and then staying out of their way.
[...] [It] is not going away because it has utility for both the developers
and users independent of economic motivations. Jim Flynn, Sunnyvale, Calif.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ross J. Reedstrom (#9)
Re: failed runcheck

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Well, you clearly don't want to be fooled by temp relations. I was
sorta visualizing a check based on relation OIDs instead of names...

Well, when I did a test implementation of OID filenames, lo these many
moons ago, I hacked around this problem by adding the (fixed) shared
system table oids to the static array that is searched for matches by
IsSharedSystemRelationName.
Admittedly, a hack, but it got past all the regresion tests.

Not a hack at all, IMHO, since all the shared system rels have
nailed-down OIDs. It's pure historical artifact that
IsSharedSystemRelationName wasn't IsSharedSystemRelationOID in the
first place.

We probably want to be thinking about merging the "shared system
relation" concept together with the "tablespace" concept once we start
to implement tablespaces. But for now, I think testing the OIDs is
fine.

regards, tom lane

#11Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Tom Lane (#8)
Re: failed runcheck

On Mon, Oct 23, 2000 at 10:26:39AM -0400, Tom Lane wrote:

Could you dig into it a little further and try to determine where the
NULL is coming from?

All clear now! (I did do another cvs update in the meantime, but either way,
I can't now repeat the previously repeatable core dump)

Cheers,

Patrick