Potential RC1-stoppers

Started by Tom Lanealmost 25 years ago7 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I'm currently concerned about these recent reports:

* Joel Burton's report of disappearing files, 3/20. This is real scary,
but no one else has reported anything like it.

* Tatsuo's weird failure in XLogFileInit ("ZeroFill: no such file or
directory"). I'm hoping this can be explained away, but probably we
ought to alter the code so that we can detect the case where no errno
is set by write() and avoid printing a bogus message.

Do people feel comfortable putting out RC1 when we don't know the
reasons for these reports?

Another thing I'd like to fix before RC1 is Adriaan's complaint about
mishandling of int8-sized numeric constants on Alpha. Seems to me that
we want Alpha to behave like other platforms, ie T_Integer parse nodes
should only be generated for values that fit in int4. Otherwise Alpha
will have different type resolution behavior for expressions that
contain such constants, and that's going to be real confusing. I'm
thinking about making scan.l do

long x;

errno = 0;
x = strtol((char *)yytext, &endptr, 10);
if (*endptr != '\0' || errno == ERANGE
#ifdef HAVE_LONG_INT_64
/* if long is wider than 32 bits, check for overflow */
|| x != (long) ((int32) x)
#endif
)
{
/* integer too large, treat it as a float */

Objections?

regards, tom lane

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#1)
Re: Potential RC1-stoppers

I'm currently concerned about these recent reports:

* Joel Burton's report of disappearing files, 3/20. This is real scary,
but no one else has reported anything like it.

* Tatsuo's weird failure in XLogFileInit ("ZeroFill: no such file or
directory"). I'm hoping this can be explained away, but probably we
ought to alter the code so that we can detect the case where no errno
is set by write() and avoid printing a bogus message.

Do people feel comfortable putting out RC1 when we don't know the
reasons for these reports?

Can we keep an eye on these and address in 7.1.1? 7.1 will need fixes
anyway.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#3Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#2)
RE: Potential RC1-stoppers

* Joel Burton's report of disappearing files, 3/20. This is
real scary, but no one else has reported anything like it.

Can please you remind that report?

Vadim

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#3)
Re: Potential RC1-stoppers

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

* Joel Burton's report of disappearing files, 3/20. This is
real scary, but no one else has reported anything like it.

Can please you remind that report?

It's the "pg_inherits: not found, but visible" thread in pghackers
on 3/20 and 3/21. Briefly, he had two separate occurrences of a table
file disappearing while the pg_class row remained (and he hadn't
tried to delete it, either). The only idea I can come up with is that
a removal of some other table removed the wrong file. Ugly.

Joel, can you give us any more info? Do you have a postmaster log of
the queries that were issued while this was happening?

regards, tom lane

#5Joel Burton
jburton@scw.org
In reply to: Tom Lane (#4)
Re: Potential RC1-stoppers

On Thu, 22 Mar 2001, Tom Lane wrote:

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

* Joel Burton's report of disappearing files, 3/20. This is
real scary, but no one else has reported anything like it.

Can please you remind that report?

It's the "pg_inherits: not found, but visible" thread in pghackers
on 3/20 and 3/21. Briefly, he had two separate occurrences of a table
file disappearing while the pg_class row remained (and he hadn't
tried to delete it, either). The only idea I can come up with is that
a removal of some other table removed the wrong file. Ugly.

Joel, can you give us any more info? Do you have a postmaster log of
the queries that were issued while this was happening?

Sorry; I've been at client sites for the past day.

I rebooted my machine, and it didn't happen again that night. Yesterday,
my staff reinstalled Pg straight from the CVS but without (!) tarring up
the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug
switches on my development machine; this machine did not have such.

After rebooting, and since reinstalling Pg
beta-6-or-whatever-we're-at-now, it hasn't happened again. I'm afraid I
can't think of anything unusual about the PC.

Unbranded, decent-quality components AMD K6-III/550
256MB RAM
Linux-Mandrake 7.2 w/the secure version of the kernel (2.2.17, IIRC)
Pg beta4

I don't have a log, but do have the query that was issued, multiple times,
overlapping:

SELECT * FROM zope_facinst LIMIT 1000;

where zope_facinst is the view

SELECT DISTINCT ON (t.lname,
t.fname,
c.fulltitle, c.classcode,
t.trainid)
c.classcode,
t.trainid,
scw_namecode(t.fname, t.lname) AS namecode,
t.fullname,
c.fulltitle,
c.descrip,
t.descripshort AS train_descripshort,
c.descripshort AS class_descripshort
FROM vlkpclass c,
vlkptrain t,
tblinst i,
trelinsttrain it
WHERE (((c.classcode = i.classcode) AND
(i.instid = it.instid))
AND (it.trainid = t.trainid))
ORDER BY t.lname,
t.fname,
c.fulltitle,
c.classcode,
t.trainid;

So it's pretty complicated, but not terrible.

The classes starting w/'t' are tables; those starting with 'v' are
views; none of the views are too complex.

scw_namecode() is a simple pl/pgsql routine that just joins the strings
together in a particular way.

There are about 400 records returned by the view.

EXPLAIN for it looks like this:

reg2=# explain select * from zope_Facinst;
NOTICE: QUERY PLAN:

Subquery Scan zope_facinst (cost=339.93..356.42 rows=132 width=141)
-> Unique (cost=339.93..356.42 rows=132 width=141)
-> Sort (cost=339.93..339.93 rows=1319 width=141)
-> Merge Join (cost=261.33..271.56 rows=1319 width=141)
-> Sort (cost=223.52..223.52 rows=597 width=92)
-> Merge Join (cost=131.72..195.99 rows=597
width=92)
-> Index Scan using tblinst_pkey on
tblinst i (cost=0.00..53
.69 rows=769 width=16)
-> Sort (cost=131.72..131.72 rows=78
width=76)
-> Merge Join (cost=52.15..129.28
rows=78 width=76)
-> Merge Join
(cost=52.15..59.96 rows=976 width=
68)
-> Sort
(cost=27.28..27.28 rows=316 width=
40)
-> Seq Scan on
tblpers p (cost=0.00.
.14.16 rows=316 width=40)
-> Sort
(cost=24.87..24.87 rows=309 width=
28)
-> Seq Scan on
tbltrain t (cost=0.00
..12.09 rows=309 width=28)
-> Index Scan using
trelinsttrain_trainid_idx on
trelinsttrain it (cost=0.00..42.75 rows=795 width=8)
-> Sort (cost=37.82..37.82 rows=221 width=49)
-> Seq Scan on tblclass c (cost=0.00..29.21
rows=221 width=49)

I can provide a dump of the database if anyone would like, or copies of
the Zope scripts (very, very simple: they just call the ZSQL method
'select * from zope_facinst limit 1000')

Sorry I can't provide much more, and, yes, I know it sucks to have a
problem I can't replicate. Err. Computers can be like that.

I hope this helps.

--
Joel Burton <jburton@scw.org>
Director of Information Systems, Support Center of Washington

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Burton (#5)
Re: Potential RC1-stoppers

Joel Burton <jburton@scw.org> writes:

I rebooted my machine, and it didn't happen again that night. Yesterday,
my staff reinstalled Pg straight from the CVS but without (!) tarring up
the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug
switches on my development machine; this machine did not have such.

Drat.

I don't have a log, but do have the query that was issued, multiple times,
overlapping:
SELECT * FROM zope_facinst LIMIT 1000;

It's really unlikely (I hope) that the clients running SELECTs had
anything to do with it. You had mentioned that you were busy making
manual schema revisions while this went on; that process seems more
likely to be the guilty party. But if you don't have the logs anymore,
I suppose there's not much chance of reconstructing what you did :-(

I spent much of this afternoon groveling through the deletion-related
code, looking for some code path that could lead to a deletion operation
deleting the wrong file. I didn't find anything that looked plausible
enough to be worth pursuing. So I'm stumped for the moment. We'll have
to hope that if it happens again, we can gather more data.

regards, tom lane

#7Joel Burton
jburton@scw.org
In reply to: Tom Lane (#6)
Re: Potential RC1-stoppers

On Thu, 22 Mar 2001, Tom Lane wrote:

Joel Burton <jburton@scw.org> writes:

I rebooted my machine, and it didn't happen again that night. Yesterday,
my staff reinstalled Pg straight from the CVS but without (!) tarring up
the old Pg install, so I'm afraid I don't have any logs. I run Pg w/debug
switches on my development machine; this machine did not have such.

Drat.

I don't have a log, but do have the query that was issued, multiple times,
overlapping:
SELECT * FROM zope_facinst LIMIT 1000;

It's really unlikely (I hope) that the clients running SELECTs had
anything to do with it. You had mentioned that you were busy making
manual schema revisions while this went on; that process seems more
likely to be the guilty party. But if you don't have the logs anymore,
I suppose there's not much chance of reconstructing what you did :-(

The dropping and re-making were the zope_facinst view listed in my email.
I was tinkering with various parameters, trying to see if distinct on
(list) was faster than distinct list, etc.

I spent much of this afternoon groveling through the deletion-related
code, looking for some code path that could lead to a deletion operation
deleting the wrong file. I didn't find anything that looked plausible
enough to be worth pursuing. So I'm stumped for the moment. We'll have
to hope that if it happens again, we can gather more data.

It could be my machine; it's not a heavily used machine, so I can't vouch
for its stability.

Sorry I couldn't help more.

As always, thanks.
--
Joel Burton <jburton@scw.org>
Director of Information Systems, Support Center of Washington