could not open relation with OID

Started by Benabout 4 years ago3 messagesgeneral

bench@silentmedia.com

about 4 years ago

We do a lot of queries per day, over a lot of hosts, all of which are on
12.9. We've recently started doing a better job at analyzing our db logs
and have found that, a few times a day, every day, we see some of our
queries fail with errors like:

could not open relation with OID 201940279

In the cases we've examined so far, the failed query succeeds just fine
when we run it manually. The failed query also had run on an async
streaming replica, and the primary has completed at least one autovacuum
since the failure. I don't know if either of those two facts are
relevant, but I'm not sure what else to blame. The internet seems to
want to blame issues like this on temp tables, which makes sense, but in
our case, most of the queries that are failing this way are simple PK
scans, which then fall back to the table to pull all the columns. The
tables themselves are small in row count - although some values are
likely TOASTed - so I would be surprised if anything is spilling to disk
for sorting, which might have counted as a temp table enough to give
such an error.

This is a minuscule failure percentage, so replicating it is going to be
hard, but it is still breaking for reasons I don't understand, and so
I'd like to fix it. Has anybody else seen this, or have an ideas of what
to look at?

Other things we've considered:
    - we run pg_repack, which certainly seems like it could make an
error like this, but we see this error in places and times that
pg_repack isn't currently running
    - although all our servers are currently on 12.9, I don't think
this is a new error for us. I believe we might have seen it on previous
minor versions of 12 and probably on 9.5 as well.
    - our filesystem is xfs and seems reliable. I would expect that if
it was a filesystem level error, it would not be so transient. We do,
occasionally, expand our filesystems, but not at all the times we've
seen this error.

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Ben (#1)

Re: could not open relation with OID

On Wed, Jan 26, 2022 at 05:30:01PM -0800, Ben Chobot wrote:

We do a lot of queries per day, over a lot of hosts, all of which are on
12.9. We've recently started doing a better job at analyzing our db logs and
have found that, a few times a day, every day, we see some of our queries
fail with errors like:

could not open relation with OID 201940279

In the cases we've examined so far, the failed query succeeds just fine when
we run it manually. The failed query also had run on an async streaming
replica, and the primary has completed at least one autovacuum since the
failure. I don't know if either of those two facts are relevant, but I'm not
sure what else to blame. The internet seems to want to blame issues like
this on temp tables, which makes sense, but in our case, most of the queries
that are failing this way are simple PK scans, which then fall back to the
table to pull all the columns. The tables themselves are small in row count
- although some values are likely TOASTed - so I would be surprised if
anything is spilling to disk for sorting, which might have counted as a temp
table enough to give such an error.

Do those OIDs point to some specific relations? It should be easy
enough to guess to which pg_class entry they point to, especially if
you have a persistent schema, and it these are indeed temporary
entries or not depending on their pg_class.relnamespace.

This is a minuscule failure percentage, so replicating it is going to be
hard, but it is still breaking for reasons I don't understand, and so I'd
like to fix it. Has anybody else seen this, or have an ideas of what to look
at?

I don't recall seeing such reports recently.

Other things we've considered:
- we run pg_repack, which certainly seems like it could make an error
like this, but we see this error in places and times that pg_repack isn't
currently running

It could also take time for the issue to show up, depending on the
state of the relcache.
--
Michael

Ben

bench@silentmedia.com

about 4 years ago

In reply to: Michael Paquier (#2)

Re: could not open relation with OID

Michael Paquier wrote on 1/26/22 9:14 PM:

On Wed, Jan 26, 2022 at 05:30:01PM -0800, Ben Chobot wrote:
Other things we've considered:

ï¿½ï¿½ï¿½ - we run pg_repack, which certainly seems like it could make an error
like this, but we see this error in places and times that pg_repack isn't
currently running

It could also take time for the issue to show up, depending on the
state of the relcache.

So.... tell me more about stale relcaches? It turns out I was totally
wrong and this is being driven by pg_repack. I can even make it happen
pretty easily:

1. Put some data in a table with a single btree index on a primary db.
2. Set up streaming replication to a secondary db.
3. In a loop on the primary, have pg_repack repack the indices of that
table. (the -x flag)
4. In a loop on the secondary, have psql query the secondary db for an
indexed value of that table.

When I do this with replication, I can get the OID error consistently
within 30 minutes. Without replication, I've been unable to get it to
happen after 2 hours.

Given that this fails much faster on the secondary than the primary
(where it has yet to fail at all) I'm leaning towards a postgres bug,
but I'm happy to do more research to point the blame at something
pg_repack is doing, if you could point me at a thing to research.