cache lookup failed dropping public schema with trgm index

Started by Wyatt Altover 2 years ago9 messagesbugs
Jump to latest
#1Wyatt Alt
wyatt.alt@gmail.com

Hi,

This reproduces on 15.4 and 13.12:

create table foo(t text);
create extension pg_trgm;
create index on foo using gist(t gist_trgm_ops);
drop schema public cascade;

NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to table foo
drop cascades to extension pg_trgm
ERROR: cache lookup failed for function 1195999
Time: 20.968 ms

#2Michael Paquier
michael@paquier.xyz
In reply to: Wyatt Alt (#1)
Re: cache lookup failed dropping public schema with trgm index

On Mon, Aug 21, 2023 at 11:40:15AM -0700, Wyatt Alt wrote:

This reproduces on 15.4 and 13.12:

create table foo(t text);
create extension pg_trgm;
create index on foo using gist(t gist_trgm_ops);
drop schema public cascade;

Indeed, reproduced here down to 13. 12 and older versions don't react
like that. I'll go bisect that..
--
Michael

#3Andres Freund
andres@anarazel.de
In reply to: Wyatt Alt (#1)
Re: cache lookup failed dropping public schema with trgm index

Hi,

On 2023-08-21 11:40:15 -0700, Wyatt Alt wrote:

This reproduces on 15.4 and 13.12:

Also reproduces on HEAD.

create table foo(t text);
create extension pg_trgm;
create index on foo using gist(t gist_trgm_ops);
drop schema public cascade;

NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to table foo
drop cascades to extension pg_trgm
ERROR: cache lookup failed for function 1195999
Time: 20.968 ms

It also seems to work without even involving a drop schema. Just dropping
pg_trgm with cascade is sufficient.

<several wrong theories>

I think we might primarily dealing with missing invalidations. Dropping
objects is scheduled in an order where we first drop
schedule deletion of function 10 (text, text) of operator family t.gist_trgm_ops for access method gist: t.gtrgm_options(internal) at 0
schedule deletion of function t.gtrgm_options(internal) at 1
and then later
schedule deletion of index t.foo_t at 32

During the index deletion we try to initialize the access method. But haven't
performed sufficient invalidation and still think "function 10"
exists. Calling it then causes the error.

One can "verify" this theory by adding an InvalidateSystemCaches() at the end
of deleteOneObject(). That "fixes" the issue. This also explains why dropping
pg_trgm in a new session works - we don't have old cache entries that could be
out of date.

Not quite sure where we are dropping the ball with invalidations yet.

However, I suspect there's more wrong than this, albeit perhaps not
problematic in a huge way. While debugging I added a getObjectDescription()
description call to deleteObjectsInList() *before* calling
deleteOneObject(). That fails when dropping pg_trgmp, because we end up
dropping the type gtrgm before trgm_out(). The reason this happens is because
we reach gtrgm_out() via the extension dependency, rather than via the
type. When recursing to gtrgm_out(), we recurse to type gtrgm, recurse to
gtrgm_in(), schedule it for deletion, then recurse to gtrgm_out(), but find
it's in the stack and do *not* schedule it for deletion, before scheduling
gtrgm for deletion. Only then we delete gtrgm_out().

Now, this isn't a real issue in practice (without such a debugging statement,
which likely can't work in some cases), but I strongly suspect that it
indicates a scheduling order issue that's more widespread. Despite, I think,
correct dependencies, we end up with a topologically inconsistent drop
order. There aren't any cycles in the directed dependency graph from what I
can see.

Greetings,

Andres Freund

#4Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#3)
Re: cache lookup failed dropping public schema with trgm index

On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:

It also seems to work without even involving a drop schema. Just dropping
pg_trgm with cascade is sufficient.

FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

Now, this isn't a real issue in practice (without such a debugging statement,
which likely can't work in some cases), but I strongly suspect that it
indicates a scheduling order issue that's more widespread. Despite, I think,
correct dependencies, we end up with a topologically inconsistent drop
order. There aren't any cycles in the directed dependency graph from what I
can see.

Yeah, guess so. I was first betting on a missing shared inval here.
Now note that for example, this command works:
psql -v ON_ERROR_STOP=1 -c 'create table foo(t text); create extension pg_trgm; create index on foo using gist(t gist_trgm_ops);create index on foo using gist(t gist_trgm_ops); drop schema public cascade;'
--
Michael

#5Wyatt Alt
wyatt.alt@gmail.com
In reply to: Michael Paquier (#4)
Re: cache lookup failed dropping public schema with trgm index

Thanks for having a look. Note also that switching the first two lines of
my example prevents the failure.

On Mon, Aug 21, 2023 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:

Show quoted text

On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:

It also seems to work without even involving a drop schema. Just dropping
pg_trgm with cascade is sufficient.

FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

Now, this isn't a real issue in practice (without such a debugging

statement,

which likely can't work in some cases), but I strongly suspect that it
indicates a scheduling order issue that's more widespread. Despite, I

think,

correct dependencies, we end up with a topologically inconsistent drop
order. There aren't any cycles in the directed dependency graph from

what I

can see.

Yeah, guess so. I was first betting on a missing shared inval here.
Now note that for example, this command works:
psql -v ON_ERROR_STOP=1 -c 'create table foo(t text); create extension
pg_trgm; create index on foo using gist(t gist_trgm_ops);create index on
foo using gist(t gist_trgm_ops); drop schema public cascade;'
--
Michael

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#4)
Re: cache lookup failed dropping public schema with trgm index

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:

It also seems to work without even involving a drop schema. Just dropping
pg_trgm with cascade is sufficient.

FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

Hmm. I see that 911e7020 modified pg_trgm's install script with

+ALTER OPERATOR FAMILY gist_trgm_ops USING gist
+ADD FUNCTION 10 (text) gtrgm_options (internal);

I wonder whether that correctly adds a dependency to ensure the
opfamily is dropped before the function.

regards, tom lane

#7Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#6)
Re: cache lookup failed dropping public schema with trgm index

On Mon, Aug 21, 2023 at 07:27:19PM -0400, Tom Lane wrote:

Hmm. I see that 911e7020 modified pg_trgm's install script with

+ALTER OPERATOR FAMILY gist_trgm_ops USING gist
+ADD FUNCTION 10 (text) gtrgm_options (internal);

I wonder whether that correctly adds a dependency to ensure the
opfamily is dropped before the function.

The dependencies look OK seen from here when it comes to the
extension, the indexes or the option function. See, for instance:
=# SELECT pg_describe_object(classid, objid, objsubid) as obj,
pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
deptype FROM pg_depend
WHERE (classid = 'pg_proc'::regclass AND objid = 'gtrgm_options'::regproc)
OR (refclassid = 'pg_proc'::regclass AND refobjid = 'gtrgm_options'::regproc);
-[ RECORD 1]------------------------------------------------------------------------------------------------------
obj | function gtrgm_options(internal)
objref | schema public
deptype | n
-[ RECORD 2
]------------------------------------------------------------------------------------------------------
obj | function gtrgm_options(internal)
objref | extension pg_trgm
deptype | e
-[ RECORD 3]------------------------------------------------------------------------------------------------------
obj | function 10 (text, text) of operator family gist_trgm_ops for access method gist: gtrgm_options(internal)
objref | function gtrgm_options(internal)
deptype | a
--
Michael

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: cache lookup failed dropping public schema with trgm index

Hi,

On 2023-08-21 19:27:19 -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:

It also seems to work without even involving a drop schema. Just dropping
pg_trgm with cascade is sufficient.

FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

Hmm. I see that 911e7020 modified pg_trgm's install script with

+ALTER OPERATOR FAMILY gist_trgm_ops USING gist
+ADD FUNCTION 10 (text) gtrgm_options (internal);

I wonder whether that correctly adds a dependency to ensure the
opfamily is dropped before the function.

It's not quite that, I think. Whether it fails or succeeds depends on the
state of the system caches. The dependencies lead to the equivalent of
ALTER OPERATOR FAMILY ... USING ... DROP
being performed, before dropping the index.

I think there are at least three levels of problems here:

- I don't think we currently properly force index relcache entries to be
invalidated when the relevant opfamily changes

- LookupOpclassInfo() doesn't have *any* invalidation support, so even if we
were to perform invalidation, we'd still return a potentially stale
entry. The function says:

* Note there is no provision for flushing the cache. This is OK at the
* moment because there is no way to ALTER any interesting properties of an
* existing opclass --- all you can do is drop it, which will result in
* a useless but harmless dead entry in the cache.

But that's not true (amymore?), because it does do pg_amproc lookups, and
they *can* change.

- Minor: Even if you force LookupOpclassInfo() to perform lookups again, it
doesn't work, because when an amproc entry doesn't exist anymore, the old
value in opcentry->supportProcs[i] survives, because nothing resets it.

Greetings,

Andres Freund

#9Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#8)
Re: cache lookup failed dropping public schema with trgm index

On Tue, Aug 22, 2023 at 11:45:58AM -0700, Andres Freund wrote:

It's not quite that, I think. Whether it fails or succeeds depends on the
state of the system caches. The dependencies lead to the equivalent of
ALTER OPERATOR FAMILY ... USING ... DROP
being performed, before dropping the index.

I think there are at least three levels of problems here:

- I don't think we currently properly force index relcache entries to be
invalidated when the relevant opfamily changes

- LookupOpclassInfo() doesn't have *any* invalidation support, so even if we
were to perform invalidation, we'd still return a potentially stale
entry. The function says:

* Note there is no provision for flushing the cache. This is OK at the
* moment because there is no way to ALTER any interesting properties of an
* existing opclass --- all you can do is drop it, which will result in
* a useless but harmless dead entry in the cache.

But that's not true (amymore?), because it does do pg_amproc lookups, and
they *can* change.

- Minor: Even if you force LookupOpclassInfo() to perform lookups again, it
doesn't work, because when an amproc entry doesn't exist anymore, the old
value in opcentry->supportProcs[i] survives, because nothing resets it.

Another thing: we also need to think harder about
RelationReloadIndexInfo() which is the cheap path for cache reload for
the index information. This is quite different from rd_options,
though, because we are going to need an extra facility to look at the
pg_attribute tuples for the indexes to get access to the new option
options? We touch that a bit in RelationBuildTupleDesc(), but not in
this cheap reload path. This is going to require more infrastructure,
at quick glance..
--
Michael