canceling autovacuum task woes
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.
Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?
It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.
Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize. But whether that's true or not, the
current logging is wholly inadequate.
Thoughts? Anybody else have this problem?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize. But whether that's true or not, the
current logging is wholly inadequate.
Very, very, very quick guess: The relation extension lock?
Thoughts? Anybody else have this problem?
I have seen spuriously high occurances of that message before, but I never
really investigated it.
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12-07-24 01:48 PM, Robert Haas wrote:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.
Could autovacuum be compacting a lot of space at the end of the table.
This is described
in the thread
http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com
Show quoted text
Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize. But whether that's true or not, the
current logging is wholly inadequate.Thoughts? Anybody else have this problem?
On Tue, Jul 24, 2012 at 2:11 PM, Steve Singer <ssinger@ca.afilias.info> wrote:
On 12-07-24 01:48 PM, Robert Haas wrote:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.Could autovacuum be compacting a lot of space at the end of the table. This
is described
in the thread
http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com
You (and Andres) may well be right, but I think the way we find out is
to add some better logging.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 07/24/2012 01:48 PM, Robert Haas wrote:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.
+1 for more information on this.
cheers
andrew
On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.
Hm, autovacuum is supposed to set an errcontext callback that would tell
you the table name it's working on at the time of the cancel. So if
even that is missing, something strange is going on.
No objections to the general idea of adding more info about the process
blocked on autovacuum.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Alvaro Herrera's message of mar jul 24 15:30:49 -0400 2012:
On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Hm, autovacuum is supposed to set an errcontext callback that would tell
you the table name it's working on at the time of the cancel. So if
even that is missing, something strange is going on.
Yep, it says:
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"
So at least that part seems pilot error more than anything else.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Yep, it says:
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"So at least that part seems pilot error more than anything else.
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail. Here's an
example of what the output looks like:
LOG: sending cancel to blocking autovacuum PID 21595
DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT: drop table if exists pgbench_accounts
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts"
I think that's a lot more useful than just getting those last two lines...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
more-autovac-cancel-logging.patchapplication/octet-stream; name=more-autovac-cancel-logging.patchDownload
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 39,44 ****
--- 39,45 ----
#include "access/twophase.h"
#include "access/xact.h"
#include "miscadmin.h"
+ #include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "replication/syncrep.h"
#include "storage/ipc.h"
***************
*** 1086,1097 **** ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
{
int pid = autovac->pid;
! elog(DEBUG2, "sending cancel to blocking autovacuum PID %d",
! pid);
! /* don't hold the lock across the kill() syscall */
! LWLockRelease(ProcArrayLock);
/* send the autovacuum worker Back to Old Kent Road */
if (kill(pid, SIGINT) < 0)
--- 1087,1115 ----
!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
{
int pid = autovac->pid;
+ StringInfoData locktagbuf;
+ StringInfoData logbuf; /* errdetail for server log */
+
+ initStringInfo(&locktagbuf);
+ initStringInfo(&logbuf);
+ DescribeLockTag(&locktagbuf, &lock->tag);
+ appendStringInfo(&logbuf,
+ _("Process %d waits for %s on %s"),
+ MyProcPid,
+ GetLockmodeName(lock->tag.locktag_lockmethodid,
+ lockmode),
+ locktagbuf.data);
+
+ /* release lock as quickly as possible */
+ LWLockRelease(ProcArrayLock);
! ereport(LOG,
! (errmsg("sending cancel to blocking autovacuum PID %d",
! pid),
! errdetail_log("%s", logbuf.data)));
! pfree(logbuf.data);
! pfree(locktagbuf.data);
/* send the autovacuum worker Back to Old Kent Road */
if (kill(pid, SIGINT) < 0)
Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:
On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Yep, it says:
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"So at least that part seems pilot error more than anything else.
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail. Here's an
example of what the output looks like:LOG: sending cancel to blocking autovacuum PID 21595
DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT: drop table if exists pgbench_accounts
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts"
Looks great. Are you considering backpatching this?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:
On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Yep, it says:
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"So at least that part seems pilot error more than anything else.
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail. Here's an
example of what the output looks like:LOG: sending cancel to blocking autovacuum PID 21595
DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT: drop table if exists pgbench_accounts
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts"Looks great. Are you considering backpatching this?
Well, that would certainly make MY life easier. I am not sure whether
it would be in line with project policy, however.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.
Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.
regards, tom lane
On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.
Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.
New version attached.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
more-autovac-cancel-logging.patchapplication/octet-stream; name=more-autovac-cancel-logging.patchDownload
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 1086,1097 **** ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
{
int pid = autovac->pid;
! elog(DEBUG2, "sending cancel to blocking autovacuum PID %d",
! pid);
! /* don't hold the lock across the kill() syscall */
! LWLockRelease(ProcArrayLock);
/* send the autovacuum worker Back to Old Kent Road */
if (kill(pid, SIGINT) < 0)
--- 1086,1114 ----
!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
{
int pid = autovac->pid;
+ StringInfoData locktagbuf;
+ StringInfoData logbuf; /* errdetail for server log */
+
+ initStringInfo(&locktagbuf);
+ initStringInfo(&logbuf);
+ DescribeLockTag(&locktagbuf, &lock->tag);
+ appendStringInfo(&logbuf,
+ _("Process %d waits for %s on %s"),
+ MyProcPid,
+ GetLockmodeName(lock->tag.locktag_lockmethodid,
+ lockmode),
+ locktagbuf.data);
+
+ /* release lock as quickly as possible */
+ LWLockRelease(ProcArrayLock);
! ereport(LOG,
! (errmsg("sending cancel to blocking autovacuum PID %d",
! pid),
! errdetail_log("%s", logbuf.data)));
! pfree(logbuf.data);
! pfree(locktagbuf.data);
/* send the autovacuum worker Back to Old Kent Road */
if (kill(pid, SIGINT) < 0)
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Looks great. Are you considering backpatching this?
Well, that would certainly make MY life easier. I am not sure whether
it would be in line with project policy, however.
+1 for a backpatch. Otherwise it'll be years before we gain any
information about the unexpected cancels that you think exist.
However, after looking some more at deadlock.c, I wonder whether
(a) this patch gives sufficient detail, and (b) whether there isn't a
problem that's obvious by inspection. It appears to me that as the
blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
needing to be killed even though it may be several graph edges out from
the current proc. This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done. I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.
regards, tom lane
On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Looks great. Are you considering backpatching this?
Well, that would certainly make MY life easier. I am not sure whether
it would be in line with project policy, however.+1 for a backpatch. Otherwise it'll be years before we gain any
information about the unexpected cancels that you think exist
OK, great.
However, after looking some more at deadlock.c, I wonder whether
(a) this patch gives sufficient detail, and (b) whether there isn't a
problem that's obvious by inspection. It appears to me that as the
blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
needing to be killed even though it may be several graph edges out from
the current proc. This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done. I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.
Uggh. Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since presumably if we limit it to things that directly block the current process, you'll always allow the full deadlock_timeout before nuking the autovac worker. So +1 for changing that.
Does an edge in this context mean any lock, or just an ungranted one? I assume the latter, which still leaves the question of where the edges are coming from in the first place.
...Robert
Robert Haas <robertmhaas@gmail.com> writes:
On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
... This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done. I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.
Uggh. Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since presumably if we limit it to things that directly block the current process, you'll always allow the full deadlock_timeout before nuking the autovac worker. So +1 for changing that.
I think something as simple as the attached would do the trick. I've
verified that this still allows the expected cancel cases, though of
course it's harder to prove that it gives any benefit.
Does an edge in this context mean any lock, or just an ungranted one? I assume the latter, which still leaves the question of where the edges are coming from in the first place.
The deadlock code follows "hard" edges, which mean "A wants a lock that
B has already got", as well as "soft" edges, which mean "A wants a lock
that B also wants, and B is ahead of A in the queue for it". We don't
kill autovacs that are the end of soft edges, which I think is good,
but the fact that we follow them at all makes it a little unclear what
we might reach recursively. Your point about determinacy of the timeout
is probably even a better argument for only allowing the direct blockee
to wield the knife.
regards, tom lane
On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:
On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.New version attached.
The detail message should have a period at the end.
On Mon, Aug 13, 2012 at 11:59 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:
On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.New version attached.
The detail message should have a period at the end.
Oops. Fixed, sorry it took me so long to get to this.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company