autovacuum crash due to null pointer

Started by Tom Lanealmost 18 years ago7 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

There's a fairly interesting crash here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=jaguar&dt=2008-07-16%2003:00:02
The buildfarm was nice enough to provide a stack trace at the bottom of
the page, which shows clearly that autovac tried to pfree a null
pointer.

What I think happened was that the table that was selected to be
autovacuumed got dropped during the setup steps, leading get_rel_name()
to return NULL at line 2167. vacuum() itself would have fallen out
silently ... however, had it errored out, the attempts at error
reporting in the PG_CATCH block would have crashed.

I see that we already noticed and fixed this type of problem in
autovac_report_activity(), but do_autovacuum() didn't get the word.
Is there anyplace else in there with the same issue? For that matter,
why is autovac_report_activity repeating the lookups already done
at the outer level?

One other point is that the postmaster log just says

TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "mcxt.c", Line: 580)
[487d6715.3a87:2] LOG: server process (PID 16885) was terminated by signal 6: Aborted

Could we get that to say "autovacuum worker" instead of "server"?

regards, tom lane

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: autovacuum crash due to null pointer

Tom Lane wrote:

There's a fairly interesting crash here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=jaguar&dt=2008-07-16%2003:00:02
The buildfarm was nice enough to provide a stack trace at the bottom of
the page, which shows clearly that autovac tried to pfree a null
pointer.

What I think happened was that the table that was selected to be
autovacuumed got dropped during the setup steps, leading get_rel_name()
to return NULL at line 2167. vacuum() itself would have fallen out
silently ... however, had it errored out, the attempts at error
reporting in the PG_CATCH block would have crashed.

Ouch. This was introduced when we added the PG_TRY block to add the
errcontext and resume vacuuming with the next table; the original code
was clean about this problem.

I see that we already noticed and fixed this type of problem in
autovac_report_activity(), but do_autovacuum() didn't get the word.
Is there anyplace else in there with the same issue?

I'll have a more detailed look.

For that matter, why is autovac_report_activity repeating the lookups
already done at the outer level?

Ah, good catch -- it's because the outer code was added later. It
should be easy to pass the names down, I think.

One other point is that the postmaster log just says

TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "mcxt.c", Line: 580)
[487d6715.3a87:2] LOG: server process (PID 16885) was terminated by signal 6: Aborted

Could we get that to say "autovacuum worker" instead of "server"?

Hmm, that would require moving code from HandleChildCrash to
CleanupBackend, I think. (We're creating the process name in
CleanupBackend without scanning the BackendList, which we would need to
figure out if it's a worker.)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: autovacuum crash due to null pointer

Tom Lane wrote:

There's a fairly interesting crash here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=jaguar&dt=2008-07-16%2003:00:02
The buildfarm was nice enough to provide a stack trace at the bottom of
the page, which shows clearly that autovac tried to pfree a null
pointer.

What I think happened was that the table that was selected to be
autovacuumed got dropped during the setup steps, leading get_rel_name()
to return NULL at line 2167. vacuum() itself would have fallen out
silently ... however, had it errored out, the attempts at error
reporting in the PG_CATCH block would have crashed.

I'm having a hard time reproducing this problem; I added a pg_usleep()
just before get_rel_name to have the chance to drop the table, but
strangely enough it doesn't return NULL. It seems that the cache entry
is not getting invalidated. Maybe there's something else that's needed
to reproduce the crash.

Anyway, I propose this patch in HEAD to close this hole. For 8.3 I'm
thinking in just rearranging the get_*_name calls and adding the goto.
Thoughts?

I think there's another hole here: what happens if the table exists when
get_rel_name is called, but the schema is dropped between that point and
calling get_namespace_name? It is pretty unlikely but can it be
discarded completely? What should we do in that case?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#3)
Re: autovacuum crash due to null pointer

Alvaro Herrera wrote:

Anyway, I propose this patch in HEAD to close this hole. For 8.3 I'm
thinking in just rearranging the get_*_name calls and adding the goto.
Thoughts?

Sorry, really attached.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachments:

autovac-dropped-table.patchtext/x-diff; charset=us-asciiDownload+78-78
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: autovacuum crash due to null pointer

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

What I think happened was that the table that was selected to be
autovacuumed got dropped during the setup steps, leading get_rel_name()
to return NULL at line 2167. vacuum() itself would have fallen out
silently ... however, had it errored out, the attempts at error
reporting in the PG_CATCH block would have crashed.

I'm having a hard time reproducing this problem; I added a pg_usleep()
just before get_rel_name to have the chance to drop the table, but
strangely enough it doesn't return NULL. It seems that the cache entry
is not getting invalidated. Maybe there's something else that's needed
to reproduce the crash.

I think cache invals would get noticed at points where we had to open
some relation, so you probably need the sleep somewhere earlier than
that. Or just throw in an AcceptInvalidationMessages() after the sleep.

I think there's another hole here: what happens if the table exists when
get_rel_name is called, but the schema is dropped between that point and
calling get_namespace_name? It is pretty unlikely but can it be
discarded completely? What should we do in that case?

I think you can just skip the table if any of those three calls return a
null. That probably means the table/schema/whatever got dropped, and
if for some reason there's a transient failure, it'll get vacuumed next
time anyway.

regards, tom lane

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: autovacuum crash due to null pointer

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I'm having a hard time reproducing this problem; I added a pg_usleep()
just before get_rel_name to have the chance to drop the table, but
strangely enough it doesn't return NULL. It seems that the cache entry
is not getting invalidated. Maybe there's something else that's needed
to reproduce the crash.

I think cache invals would get noticed at points where we had to open
some relation, so you probably need the sleep somewhere earlier than
that. Or just throw in an AcceptInvalidationMessages() after the sleep.

AcceptInvalidationMessages did the trick.

Patches for 8.3 and HEAD attached.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachments:

autovac-dropped-table-HEAD.patchtext/x-diff; charset=us-asciiDownload+81-79
autovac-dropped-table-83.patchtext/x-diff; charset=us-asciiDownload+19-16
#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#6)
Re: autovacuum crash due to null pointer

Alvaro Herrera wrote:

Patches for 8.3 and HEAD attached.

And applied.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.