Backend crashes - what's going on here???

Started by Nonamealmost 28 years ago4 messages

jwieck@debis.com

almost 28 years ago

Hey,

the current snapshot dumps core on the 4th time doing

REVOKE ALL ON pg_user FROM public;

It does too in other situations but this is the simplest to
reproduce. The segmentation fault happens in nocachegetattr()
due to a destroyed tuple descriptor (natts = 0!!! and the
others don't look good either) for the syscache 21 (USENAME).
But the destruction must happen somewhere else.

With the 02/13 snapshot I haven't got any problems on it.
But cannot find the error with diff.

BTW: Doing last checks on view permissions - sending a patch
soon.

Until later, Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #

Bruce Momjian

maillist@candle.pha.pa.us

almost 28 years ago

In reply to: Noname (#1)

Re: [HACKERS] Backend crashes - what's going on here???

Hey,

the current snapshot dumps core on the 4th time doing

REVOKE ALL ON pg_user FROM public;

It does too in other situations but this is the simplest to
reproduce. The segmentation fault happens in nocachegetattr()
due to a destroyed tuple descriptor (natts = 0!!! and the
others don't look good either) for the syscache 21 (USENAME).
But the destruction must happen somewhere else.

With the 02/13 snapshot I haven't got any problems on it.
But cannot find the error with diff.

BTW: Doing last checks on view permissions - sending a patch
soon.

Yep, I saw this too when testing my password acl null patch. Couldn't
reproduce it, so I thought it was a fluke.

--
Bruce Momjian
maillist@candle.pha.pa.us

Noname

jwieck@debis.com

almost 28 years ago

In reply to: Bruce Momjian (#2)

Re: [HACKERS] Backend crashes - what's going on here???

Whow - gdb is a nice tool

Hey,

the current snapshot dumps core on the 4th time doing

REVOKE ALL ON pg_user FROM public;

It does too in other situations but this is the simplest to
reproduce. The segmentation fault happens in nocachegetattr()
due to a destroyed tuple descriptor (natts = 0!!! and the
others don't look good either) for the syscache 21 (USENAME).
But the destruction must happen somewhere else.

With the 02/13 snapshot I haven't got any problems on it.
But cannot find the error with diff.

BTW: Doing last checks on view permissions - sending a patch
soon.

Yep, I saw this too when testing my password acl null patch. Couldn't
reproduce it, so I thought it was a fluke.

--
Bruce Momjian
maillist@candle.pha.pa.us

Have a clue now what causes the crash. It happens when
pg_user is looked up in the syscache. It must have to do with
the fact that during initialization in miscinit.c on
SetUserId() the user tuple is fetched using
SearchSysCacheTuple(). Due to this the SysCache entry 21
gets initialized but later on start transaction through the
cache reset the memory for the cc_tupdesc in the cache is
freed. So I assume when SetUserId() is called, the syscache
is not ready for use yet.

I don't have a solution right now. Is someone more familiar
with the handling of the syscache during startup? Is
SetUserId() just called a little too early or is the syscache
unusable during InitPostgres at all?

But the fact that CatalogCacheInitializeCache() is called
only for pg_user during startup makes me feel sure that the
lookup of the user using SearchSysCacheTuple() is wrong at
this time. I think it sould be done without using the
syscache.

Back on monday - maybe with a solution.

Jan

Noname

jwieck@debis.com

almost 28 years ago

In reply to: Noname (#3)

Re: [HACKERS] Backend crashes - what's going on here???

Uhhh - much more ugly than I thought first :-(

I wrote:

Whow - gdb is a nice tool

Hey,

the current snapshot dumps core on the 4th time doing

REVOKE ALL ON pg_user FROM public;

It does too in other situations but this is the simplest to
reproduce. The segmentation fault happens in nocachegetattr()
due to a destroyed tuple descriptor (natts = 0!!! and the
others don't look good either) for the syscache 21 (USENAME).
But the destruction must happen somewhere else.

With the 02/13 snapshot I haven't got any problems on it.
But cannot find the error with diff.

BTW: Doing last checks on view permissions - sending a patch
soon.

Yep, I saw this too when testing my password acl null patch. Couldn't
reproduce it, so I thought it was a fluke.

--
Bruce Momjian
maillist@candle.pha.pa.us

Have a clue now what causes the crash. It happens when
pg_user is looked up in the syscache. It must have to do with
the fact that during initialization in miscinit.c on
SetUserId() the user tuple is fetched using
SearchSysCacheTuple(). Due to this the SysCache entry 21
gets initialized but later on start transaction through the
cache reset the memory for the cc_tupdesc in the cache is
freed. So I assume when SetUserId() is called, the syscache
is not ready for use yet.

I don't have a solution right now. Is someone more familiar
with the handling of the syscache during startup? Is
SetUserId() just called a little too early or is the syscache
unusable during InitPostgres at all?

But the fact that CatalogCacheInitializeCache() is called
only for pg_user during startup makes me feel sure that the
lookup of the user using SearchSysCacheTuple() is wrong at
this time. I think it sould be done without using the
syscache.

Back on monday - maybe with a solution.

The crash is due to the cache invalidations on updates to
pg_class (and can happen too on updates to pg_attribute and
others).

When a tuple in pg_class or the others is modified, its cache
invalidation causes a RelationFlushRelation() for the
affected relation. revoking from pg_user e.g. means that
RelationFlushRelation() is called for pg_user but this frees
the tuple desctiptor. The tuple descriptor is also used in
the SysCache, and this isn't flushed/freed!

There are more possible errors on this. A simple

UPDATE pg_class SET relname = relname;

let's the backend crash on the very next command. And

REVOKE ALL ON pg_class FROM public;

crashes immediately because the cache invalidation needs the
just invalidated heap tuple for pg_class in pg_class. Sounds
a bit hairy.

I think this is also the reason for backend crashes I had
when defining rewrite rules on relations that already exist
(where I expect others that already noticed them).

I still don't have the solution. But this must get fixed
before releasing 6.3. I think a walk through the SysCache on
RelationFlushRelation() looking if this relation is in the
SysCache and if found resetting this cache can help (except
for the revoke on pg_class).

Append this to TODO!

Jan