error: could not find pg_class tuple for index 2662

Started by davegalmost 15 years ago33 messageshackers

daveg@sonic.net

almost 15 years ago

My client has been seeing regular instances of the following sort of problem:

...
03:06:09.453 exec_simple_query, postgres.c:900
03:06:12.042 XX000: could not find pg_class tuple for index 2662 at character 13
03:06:12.042 RelationReloadIndexInfo, relcache.c:1740
03:06:12.042 INSERT INTO zzz_k(k) SELECT ...
03:06:12.045 00000: statement: ABORT
03:06:12.045 exec_simple_query, postgres.c:900
03:06:12.045 00000: duration: 0.100 ms
03:06:12.045 exec_simple_query, postgres.c:1128
03:06:12.046 00000: statement: INSERT INTO temp_807
VALUES (...)
03:06:12.046 exec_simple_query, postgres.c:900
03:06:12.046 XX000: could not find pg_class tuple for index 2662 at character 13
03:06:12.046 RelationReloadIndexInfo, relcache.c:1740
03:06:12.046 INSERT INTO temp_807
VALUES (...)
03:06:12.096 08P01: unexpected EOF on client connection
03:06:12.096 SocketBackend, postgres.c:348
03:06:12.096 XX000: could not find pg_class tuple for index 2662
03:06:12.096 RelationReloadIndexInfo, relcache.c:1740
03:06:12.121 00000: disconnection: session time: 0:06:08.537 user=ZZZ database=ZZZ_01
03:06:12.121 log_disconnections, postgres.c:4339

The above happens regularly (but not completely predictably) corresponding
with a daily cronjob that checks the catalogs for bloat and does vacuum full
and/or reindex as needed. Since some of the applications make very heavy
use of temp tables this will usually mean pg_class and pg_index get vacuum
full and reindex.

Sometimes queries will fail due to being unable to open a tables containing
file. On investigation the file will be absent in both the catalogs and the
filesystem so I don't know what table it refers to:

20:41:19.063 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory
20:41:19.063 STATEMENT: insert into r_ar__30
select aid, mid, pid, sum(wdata) as wdata, ...
--
20:41:19.430 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory
20:41:19.430 STATEMENT: SELECT nextval('j_id_seq')

Finallly, I have seen a several instances of failure to read data by
vacuum full itself:

03:05:45.699 00000: statement: vacuum full pg_catalog.pg_index;
03:05:45.699 exec_simple_query, postgres.c:900
03:05:46.142 XX001: could not read block 65 in file "pg_tblspc/16401/PG_9.0_201008051/16416/1049146489": read only 0 of 8192 bytes
03:05:46.142 mdread, md.c:656
03:05:46.142 vacuum full pg_catalog.pg_index;

This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.

This seems possibly related to the issues in:

Bizarre buildfarm failure on baiji: can't find pg_class_oid_index
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02038.php
Broken HOT chains in system catalogs
http://archives.postgresql.org/pgsql-hackers/2011-04/msg00777.php

As far as I can tell from the logs I have, once a session sees one of these
errors any subsequent query will hit it again until the session exits.
However, it does not seem to harm other sessions or leave any persistant
damage (crossing fingers and hoping here).

I'm ready to do any testing/investigation/instrumented builds etc that may be
helpful in resolving this.

Regards

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: daveg (#1)

Re: error: could not find pg_class tuple for index 2662

On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote:

My client has been seeing regular instances of the following sort of problem:

On what version of PostgreSQL?

If simplicity worked, the world would be overrun with insects.

I thought it was... :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Robert Haas (#2)

Re: error: could not find pg_class tuple for index 2662

On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote:

On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote:

My client has been seeing regular instances of the following sort of problem:

On what version of PostgreSQL?

9.0.4.

I previously said:

This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: daveg (#3)

Re: error: could not find pg_class tuple for index 2662

On Thu, Jul 28, 2011 at 5:46 PM, daveg <daveg@sonic.net> wrote:

On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote:

On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote:

My client has been seeing regular instances of the following sort of problem:

On what version of PostgreSQL?

9.0.4.

I previously said:

This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.

Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which
means that a REINDEX is happening as part of the same operation. In
9.0, there's no point in doing VACUUM FULL immediately followed by
REINDEX. My guess is that this is happening either right around the
time the VACUUM FULL commits or right around the time the REINDEX
commits. It'd be helpful to know which, if you can figure it out.

If there's not a hardware problem causing those read errors, maybe a
backend is somehow ending up with a stale or invalid relcache entry.
I'm not sure exactly how that could be happening, though...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Robert Haas (#4)

Re: error: could not find pg_class tuple for index 2662

On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:

On Thu, Jul 28, 2011 at 5:46 PM, daveg <daveg@sonic.net> wrote:

On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote:

On Wed, Jul 27, 2011 at 8:28 PM, daveg <daveg@sonic.net> wrote:

My client has been seeing regular instances of the following sort of problem:

On what version of PostgreSQL?

9.0.4.

I previously said:

This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.

Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which
means that a REINDEX is happening as part of the same operation. In
9.0, there's no point in doing VACUUM FULL immediately followed by
REINDEX. My guess is that this is happening either right around the
time the VACUUM FULL commits or right around the time the REINDEX
commits. It'd be helpful to know which, if you can figure it out.

I'll update my vacuum script to skip reindexes after vacuum full for 9.0
servers and see if that makes the problem go away. Thanks for reminding
me that they are not needed. However, I suspect it is the vacuum, not the
reindex causing the problem. I'll update when I know.

If there's not a hardware problem causing those read errors, maybe a
backend is somehow ending up with a stale or invalid relcache entry.
I'm not sure exactly how that could be happening, though...

It does not appear to be a hardware problem. I also suspect it is a stale
relcache.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: daveg (#5)

Re: error: could not find pg_class tuple for index 2662

daveg <daveg@sonic.net> writes:

On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:

Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which
means that a REINDEX is happening as part of the same operation. In
9.0, there's no point in doing VACUUM FULL immediately followed by
REINDEX. My guess is that this is happening either right around the
time the VACUUM FULL commits or right around the time the REINDEX
commits. It'd be helpful to know which, if you can figure it out.

I'll update my vacuum script to skip reindexes after vacuum full for 9.0
servers and see if that makes the problem go away.

The thing that was bizarre about the one instance in the buildfarm was
that the error was persistent, ie, once a session had failed all its
subsequent attempts to access pg_class failed too. I gather from Dave's
description that it's working that way for him too. I can think of ways
that there might be a transient race condition against a REINDEX, but
it's very unclear why the failure would persist across multiple
attempts. The best idea I can come up with is that the session has
somehow cached a wrong commit status for the reindexing transaction,
causing it to believe that both old and new copies of the index's
pg_class row are dead ... but how could that happen? The underlying
state in the catalog is not wrong, because no concurrent sessions are
upset (at least not in the buildfarm case ... Dave, do you see more than
one session doing this at a time?).

regards, tom lane

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Tom Lane (#6)

Re: error: could not find pg_class tuple for index 2662

On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

daveg <daveg@sonic.net> writes:

On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:

Ah, OK, sorry. Well, in 9.0, VACUUM FULL is basically CLUSTER, which
means that a REINDEX is happening as part of the same operation. In
9.0, there's no point in doing VACUUM FULL immediately followed by
REINDEX. My guess is that this is happening either right around the
time the VACUUM FULL commits or right around the time the REINDEX
commits. It'd be helpful to know which, if you can figure it out.

I'll update my vacuum script to skip reindexes after vacuum full for 9.0
servers and see if that makes the problem go away.

The thing that was bizarre about the one instance in the buildfarm was
that the error was persistent, ie, once a session had failed all its
subsequent attempts to access pg_class failed too. I gather from Dave's
description that it's working that way for him too. I can think of ways
that there might be a transient race condition against a REINDEX, but
it's very unclear why the failure would persist across multiple
attempts. The best idea I can come up with is that the session has
somehow cached a wrong commit status for the reindexing transaction,
causing it to believe that both old and new copies of the index's
pg_class row are dead ... but how could that happen? The underlying
state in the catalog is not wrong, because no concurrent sessions are
upset (at least not in the buildfarm case ... Dave, do you see more than
one session doing this at a time?).

I was thinking more along the lines of a failure while processing a
sinval message emitted by the REINDEX. The sinval message doesn't get
fully processed and therefore we get confused about what the
relfilenode is for pg_class. If that happened for any other relation,
we could recover by scanning pg_class. But if it happens for pg_class
or pg_class_oid_index, we're toast, because we can't scan them without
knowing what relfilenode to open.

Now that can't be quite right, because of course those are mapped
relations... and I'm pretty sure there are some other holes in my line
of reasoning too. Just thinking out loud...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Robert Haas (#7)

Re: error: could not find pg_class tuple for index 2662

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The thing that was bizarre about the one instance in the buildfarm was
that the error was persistent, ie, once a session had failed all its
subsequent attempts to access pg_class failed too.

I was thinking more along the lines of a failure while processing a
sinval message emitted by the REINDEX. The sinval message doesn't get
fully processed and therefore we get confused about what the
relfilenode is for pg_class. If that happened for any other relation,
we could recover by scanning pg_class. But if it happens for pg_class
or pg_class_oid_index, we're toast, because we can't scan them without
knowing what relfilenode to open.

Well, no, because the ScanPgRelation call is not failing internally.
It's performing a seqscan of pg_class and not finding a matching tuple.

You could hypothesize about maybe an sinval message got missed leading
us to scan the old (pre-VAC-FULL) copy of pg_class, but that still
wouldn't explain how come we can't find a valid-looking entry for
pg_class_oid_index in it.

Tis a puzzlement.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Tom Lane (#8)

Re: error: could not find pg_class tuple for index 2662

On Fri, Jul 29, 2011 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jul 29, 2011 at 9:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The thing that was bizarre about the one instance in the buildfarm was
that the error was persistent, ie, once a session had failed all its
subsequent attempts to access pg_class failed too.

I was thinking more along the lines of a failure while processing a
sinval message emitted by the REINDEX. The sinval message doesn't get
fully processed and therefore we get confused about what the
relfilenode is for pg_class. If that happened for any other relation,
we could recover by scanning pg_class. But if it happens for pg_class
or pg_class_oid_index, we're toast, because we can't scan them without
knowing what relfilenode to open.

Well, no, because the ScanPgRelation call is not failing internally.
It's performing a seqscan of pg_class and not finding a matching tuple.

SnapshotNow race?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Robert Haas (#9)

Re: error: could not find pg_class tuple for index 2662

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jul 29, 2011 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, no, because the ScanPgRelation call is not failing internally.
It's performing a seqscan of pg_class and not finding a matching tuple.

SnapshotNow race?

That's what I would have guessed to start with, except that then it
wouldn't be a persistent failure. You'd get one, but then the next
query would succeed.

regards, tom lane

#11

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Tom Lane (#6)

Re: error: could not find pg_class tuple for index 2662

On Fri, Jul 29, 2011 at 09:55:46AM -0400, Tom Lane wrote:

The thing that was bizarre about the one instance in the buildfarm was
that the error was persistent, ie, once a session had failed all its
subsequent attempts to access pg_class failed too. I gather from Dave's
description that it's working that way for him too. I can think of ways
that there might be a transient race condition against a REINDEX, but
it's very unclear why the failure would persist across multiple
attempts. The best idea I can come up with is that the session has
somehow cached a wrong commit status for the reindexing transaction,
causing it to believe that both old and new copies of the index's
pg_class row are dead ... but how could that happen? The underlying

It is definitely persistant. Once triggered the error occurs for any new
statement until the session exits.

state in the catalog is not wrong, because no concurrent sessions are
upset (at least not in the buildfarm case ... Dave, do you see more than
one session doing this at a time?).

It looks like it happens to multiple sessions so far as one can tell from
the timestamps of the errors:

timestamp sessionid error
------------ ------------- ----------------------------------------------
03:05:37.434 4e26a861.4a6d could not find pg_class tuple for index 2662
03:05:37.434 4e26a861.4a6f could not find pg_class tuple for index 2662

03:06:12.041 4e26a731.438e could not find pg_class tuple for index 2662

03:06:12.042 4e21b6a3.629b could not find pg_class tuple for index 2662
03:06:12.042 4e26a723.42ec could not find pg_class tuple for index 2662 at character 13

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#12

daveg

daveg@sonic.net

almost 15 years ago

In reply to: daveg (#5)

Re: error: could not find pg_class tuple for index 2662

On Thu, Jul 28, 2011 at 11:31:31PM -0700, daveg wrote:

On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:

REINDEX. My guess is that this is happening either right around the
time the VACUUM FULL commits or right around the time the REINDEX
commits. It'd be helpful to know which, if you can figure it out.

I'll update my vacuum script to skip reindexes after vacuum full for 9.0
servers and see if that makes the problem go away. Thanks for reminding
me that they are not needed. However, I suspect it is the vacuum, not the
reindex causing the problem. I'll update when I know.

Here is the update: the problem happens with vacuum full alone, no reindex
is needed to trigger it. I updated the script to avoid reindexing after
vacuum. Over the past two days there are still many ocurrances of this
error coincident with the vacuum.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: daveg (#12)

Re: error: could not find pg_class tuple for index 2662

daveg <daveg@sonic.net> writes:

Here is the update: the problem happens with vacuum full alone, no reindex
is needed to trigger it. I updated the script to avoid reindexing after
vacuum. Over the past two days there are still many ocurrances of this
error coincident with the vacuum.

Well, that jives with the assumption that the one case we saw in
the buildfarm was the same thing, because the regression tests were
certainly only doing a VACUUM FULL and not a REINDEX of pg_class.
But it doesn't get us much closer to understanding what's happening.
In particular, it seems to knock out most ideas associated with race
conditions, because the VAC FULL should hold exclusive lock on pg_class
until it's completely done (including index rebuilds).

I think we need to start adding some instrumentation so we can get a
better handle on what's going on in your database. If I were to send
you a source-code patch for the server that adds some more logging
printout when this happens, would you be willing/able to run a patched
build on your machine?

(BTW, just to be perfectly clear ... the "could not find pg_class tuple"
errors always mention index 2662, right, never any other number?)

regards, tom lane

#14

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Tom Lane (#13)

Re: error: could not find pg_class tuple for index 2662

On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:

daveg <daveg@sonic.net> writes:

Here is the update: the problem happens with vacuum full alone, no reindex
is needed to trigger it. I updated the script to avoid reindexing after
vacuum. Over the past two days there are still many ocurrances of this
error coincident with the vacuum.

Well, that jives with the assumption that the one case we saw in
the buildfarm was the same thing, because the regression tests were
certainly only doing a VACUUM FULL and not a REINDEX of pg_class.
But it doesn't get us much closer to understanding what's happening.
In particular, it seems to knock out most ideas associated with race
conditions, because the VAC FULL should hold exclusive lock on pg_class
until it's completely done (including index rebuilds).

I think we need to start adding some instrumentation so we can get a
better handle on what's going on in your database. If I were to send
you a source-code patch for the server that adds some more logging
printout when this happens, would you be willing/able to run a patched
build on your machine?

Yes we can run an instrumented server so long as the instrumentation does
not interfere with normal operation. However, scheduling downtime to switch
binaries is difficult, and generally needs to be happen on a weekend, but
sometimes can be expedited. I'll look into that.

(BTW, just to be perfectly clear ... the "could not find pg_class tuple"
errors always mention index 2662, right, never any other number?)

Yes, only index 2662, never any other.

I'm attaching a somewhat redacted log for two different databases on the same
instance around the time of vacuum full of pg_class in each database.
My observations so far are:

- the error occurs at commit of vacuum full of pg_class
- in these cases error hits autovacuum after it waited for a lock on pg_class
- in these two cases there was a new process startup while the vacuum was
running. Don't know if this is relevant.
- while these hit autovacuum, the error does hit other processs (just not in
these sessions). Unknown if autovacuum is a required component.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: daveg (#14)

Re: error: could not find pg_class tuple for index 2662

daveg <daveg@sonic.net> writes:

On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:

I think we need to start adding some instrumentation so we can get a
better handle on what's going on in your database. If I were to send
you a source-code patch for the server that adds some more logging
printout when this happens, would you be willing/able to run a patched
build on your machine?

Yes we can run an instrumented server so long as the instrumentation does
not interfere with normal operation. However, scheduling downtime to switch
binaries is difficult, and generally needs to be happen on a weekend, but
sometimes can be expedited. I'll look into that.

OK, attached is a patch against 9.0 branch that will re-scan pg_class
after a failure of this sort occurs, and log what it sees in the tuple
header fields for each tuple for the target index. This should give us
some useful information. It might be worthwhile for you to also log the
results of

select relname,pg_relation_filenode(oid) from pg_class
where relname like 'pg_class%';

in your script that does VACUUM FULL, just before and after each time it
vacuums pg_class. That will help in interpreting the relfilenodes in
the log output.

My observations so far are:

- the error occurs at commit of vacuum full of pg_class
- in these cases error hits autovacuum after it waited for a lock on pg_class
- in these two cases there was a new process startup while the vacuum was
running. Don't know if this is relevant.

Interesting. We'll want to know whether that happens every time.

- while these hit autovacuum, the error does hit other processs (just not in
these sessions). Unknown if autovacuum is a required component.

Good question. Please consider setting log_autovacuum_min_duration = 0
so that the log also traces all autovacuum activity.

regards, tom lane

#16

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Tom Lane (#15)

Re: error: could not find pg_class tuple for index 2662

On Mon, Aug 01, 2011 at 01:23:49PM -0400, Tom Lane wrote:

daveg <daveg@sonic.net> writes:

On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:

I think we need to start adding some instrumentation so we can get a
better handle on what's going on in your database. If I were to send
you a source-code patch for the server that adds some more logging
printout when this happens, would you be willing/able to run a patched
build on your machine?

Yes we can run an instrumented server so long as the instrumentation does
not interfere with normal operation. However, scheduling downtime to switch
binaries is difficult, and generally needs to be happen on a weekend, but
sometimes can be expedited. I'll look into that.

OK, attached is a patch against 9.0 branch that will re-scan pg_class
after a failure of this sort occurs, and log what it sees in the tuple
header fields for each tuple for the target index. This should give us
some useful information. It might be worthwhile for you to also log the
results of

select relname,pg_relation_filenode(oid) from pg_class
where relname like 'pg_class%';

in your script that does VACUUM FULL, just before and after each time it
vacuums pg_class. That will help in interpreting the relfilenodes in
the log output.

We have installed the patch and have encountered the error as usual.
However there is no additional output from the patch. I'm speculating
that the pg_class scan in ScanPgRelationDetailed() fails to return
tuples somehow.

I have also been trying to trace it further by reading the code, but have not
got any solid hypothesis yet. In the absence of any debugging output I've
been trying to deduce the call tree leading to the original failure. So far
it looks like this:

RelationReloadIndexInfo(Relation)
// Relation is 2662 and !rd_isvalid
pg_class_tuple = ScanPgRelation(2662, indexOK=false) // returns NULL
pg_class_desc = heap_open(1259, ACC_SHARE)
r = relation_open(1259, ACC_SHARE) // locks oid, ensures RelationIsValid(r)
r = RelationIdGetRelation(1259)
r = RelationIdCacheLookup(1259) // assume success
if !rd_isvalid:
RelationClearRelation(r, true)
RelationInitPhysicalAddr(r) // r is pg_class relcache

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: daveg (#16)

Re: error: could not find pg_class tuple for index 2662

daveg <daveg@sonic.net> writes:

We have installed the patch and have encountered the error as usual.
However there is no additional output from the patch. I'm speculating
that the pg_class scan in ScanPgRelationDetailed() fails to return
tuples somehow.

Evidently not, if it's not logging anything, but now the question is
why. One possibility is that for some reason RelationGetNumberOfBlocks
is persistently lying about the file size. (We've seen kernel bugs
before that resulted in transiently wrong values, so this isn't totally
beyond the realm of possibility.) Please try the attached patch, which
extends the previous one to add a summary line including the number of
blocks physically scanned by the seqscan.

regards, tom lane

#18

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Tom Lane (#17)

Re: error: could not find pg_class tuple for index 2662

On Wed, Aug 03, 2011 at 11:18:20AM -0400, Tom Lane wrote:

Evidently not, if it's not logging anything, but now the question is
why. One possibility is that for some reason RelationGetNumberOfBlocks
is persistently lying about the file size. (We've seen kernel bugs
before that resulted in transiently wrong values, so this isn't totally
beyond the realm of possibility.) Please try the attached patch, which
extends the previous one to add a summary line including the number of
blocks physically scanned by the seqscan.

Ok, I have results from the latest patch and have attached a redacted
server log with the select relfilenode output added inline. This is the
shorter of the logs and shows the sequence pretty clearly. I have additional
logs if wanted.

Summary: the failing process reads 0 rows from 0 blocks from the OLD
relfilenode.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: daveg (#18)

Re: error: could not find pg_class tuple for index 2662

daveg <daveg@sonic.net> writes:

Summary: the failing process reads 0 rows from 0 blocks from the OLD
relfilenode.

Hmm. This seems to mean that we're somehow missing a relation mapping
invalidation message, or perhaps not processing it soon enough during
some complex set of invalidations. I did some testing with that in mind
but couldn't reproduce the failure. It'd be awfully nice to get a look
at the call stack when this happens for you ... what OS are you running?

regards, tom lane

#20

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Tom Lane (#19)

Re: error: could not find pg_class tuple for index 2662

On Thu, Aug 04, 2011 at 12:28:31PM -0400, Tom Lane wrote:

daveg <daveg@sonic.net> writes:

Summary: the failing process reads 0 rows from 0 blocks from the OLD
relfilenode.

Hmm. This seems to mean that we're somehow missing a relation mapping
invalidation message, or perhaps not processing it soon enough during
some complex set of invalidations. I did some testing with that in mind
but couldn't reproduce the failure. It'd be awfully nice to get a look
at the call stack when this happens for you ... what OS are you running?

cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Linux version 2.6.18-194.el5

I can use gdb as well if we can get a core or stop the correct process.
Perhaps a long sleep when it hits this?
Or perhaps we could log invalidate processing for pg_class?

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#21

daveg

daveg@sonic.net

almost 15 years ago

In reply to: Tom Lane (#19)

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: daveg (#21)