canceling autovacuum task woes

Started by Robert Haasover 13 years ago17 messages

robertmhaas@gmail.com

over 13 years ago

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize. But whether that's true or not, the
current logging is wholly inadequate.

Thoughts? Anybody else have this problem?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Robert Haas (#1)

Re: canceling autovacuum task woes

Hi,

On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize. But whether that's true or not, the
current logging is wholly inadequate.

Very, very, very quick guess: The relation extension lock?

Thoughts? Anybody else have this problem?

I have seen spuriously high occurances of that message before, but I never
really investigated it.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Steve Singer

ssinger@ca.afilias.info

over 13 years ago

In reply to: Robert Haas (#1)

Re: canceling autovacuum task woes

On 12-07-24 01:48 PM, Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Could autovacuum be compacting a lot of space at the end of the table.
This is described
in the thread
http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com

Show quoted text

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize. But whether that's true or not, the
current logging is wholly inadequate.

Thoughts? Anybody else have this problem?

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Steve Singer (#3)

Re: canceling autovacuum task woes

On Tue, Jul 24, 2012 at 2:11 PM, Steve Singer <ssinger@ca.afilias.info> wrote:

On 12-07-24 01:48 PM, Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Could autovacuum be compacting a lot of space at the end of the table. This
is described
in the thread
http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com

You (and Andres) may well be right, but I think the way we find out is
to add some better logging.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: Robert Haas (#1)

Re: canceling autovacuum task woes

On 07/24/2012 01:48 PM, Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

+1 for more information on this.

cheers

andrew

Alvaro Herrera

alvherre@commandprompt.com

over 13 years ago

In reply to: Andres Freund (#2)

Re: canceling autovacuum task woes

On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any. What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

Hm, autovacuum is supposed to set an errcontext callback that would tell
you the table name it's working on at the time of the cancel. So if
even that is missing, something strange is going on.

No objections to the general idea of adding more info about the process
blocked on autovacuum.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Alvaro Herrera

alvherre@commandprompt.com

over 13 years ago

In reply to: Alvaro Herrera (#6)

Re: canceling autovacuum task woes

Excerpts from Alvaro Herrera's message of mar jul 24 15:30:49 -0400 2012:

On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that "canceling autovacuum task" shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.

Hm, autovacuum is supposed to set an errcontext callback that would tell
you the table name it's working on at the time of the cancel. So if
even that is missing, something strange is going on.

Yep, it says:

ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"

So at least that part seems pilot error more than anything else.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Alvaro Herrera (#7)

1 attachment(s)

Re: canceling autovacuum task woes

On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Yep, it says:

ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"

So at least that part seems pilot error more than anything else.

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail. Here's an
example of what the output looks like:

LOG: sending cancel to blocking autovacuum PID 21595
DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT: drop table if exists pgbench_accounts
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts"

I think that's a lot more useful than just getting those last two lines...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

more-autovac-cancel-logging.patchapplication/octet-stream; name=more-autovac-cancel-logging.patchDownload

*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 39,44 ****
--- 39,45 ----
  #include "access/twophase.h"
  #include "access/xact.h"
  #include "miscadmin.h"
+ #include "pgstat.h"
  #include "postmaster/autovacuum.h"
  #include "replication/syncrep.h"
  #include "storage/ipc.h"
***************
*** 1086,1097 **** ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
  			{
  				int			pid = autovac->pid;
  
! 				elog(DEBUG2, "sending cancel to blocking autovacuum PID %d",
! 					 pid);
  
! 				/* don't hold the lock across the kill() syscall */
! 				LWLockRelease(ProcArrayLock);
  
  				/* send the autovacuum worker Back to Old Kent Road */
  				if (kill(pid, SIGINT) < 0)
--- 1087,1115 ----
  				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
  			{
  				int			pid = autovac->pid;
+ 				StringInfoData locktagbuf;
+ 				StringInfoData logbuf;		/* errdetail for server log */
+ 
+ 				initStringInfo(&locktagbuf);
+ 				initStringInfo(&logbuf);
+ 				DescribeLockTag(&locktagbuf, &lock->tag);
+ 				appendStringInfo(&logbuf,
+ 					  _("Process %d waits for %s on %s"),
+ 						 MyProcPid,
+ 						 GetLockmodeName(lock->tag.locktag_lockmethodid,
+ 										 lockmode),
+ 						 locktagbuf.data);
+ 
+ 				/* release lock as quickly as possible */
+ 				LWLockRelease(ProcArrayLock);
  
! 				ereport(LOG,
! 						(errmsg("sending cancel to blocking autovacuum PID %d",
! 							pid),
! 						 errdetail_log("%s", logbuf.data)));
  
! 				pfree(logbuf.data);
! 				pfree(locktagbuf.data);
  
  				/* send the autovacuum worker Back to Old Kent Road */
  				if (kill(pid, SIGINT) < 0)

Alvaro Herrera

alvherre@commandprompt.com

over 13 years ago

In reply to: Robert Haas (#8)

Re: canceling autovacuum task woes

Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:

On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Yep, it says:

ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"

So at least that part seems pilot error more than anything else.

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail. Here's an
example of what the output looks like:

LOG: sending cancel to blocking autovacuum PID 21595
DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT: drop table if exists pgbench_accounts
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts"

Looks great. Are you considering backpatching this?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#10

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Alvaro Herrera (#9)

Re: canceling autovacuum task woes

On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:

On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Yep, it says:

ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "alvherre.public.foo"

So at least that part seems pilot error more than anything else.

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail. Here's an
example of what the output looks like:

LOG: sending cancel to blocking autovacuum PID 21595
DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT: drop table if exists pgbench_accounts
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts"

Looks great. Are you considering backpatching this?

Well, that would certainly make MY life easier. I am not sure whether
it would be in line with project policy, however.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#8)

Re: canceling autovacuum task woes

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.

Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.

regards, tom lane

#12

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#11)

1 attachment(s)

Re: canceling autovacuum task woes

On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.

Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.

Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.

New version attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

more-autovac-cancel-logging.patchapplication/octet-stream; name=more-autovac-cancel-logging.patchDownload

*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 1086,1097 **** ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
  				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
  			{
  				int			pid = autovac->pid;
  
! 				elog(DEBUG2, "sending cancel to blocking autovacuum PID %d",
! 					 pid);
  
! 				/* don't hold the lock across the kill() syscall */
! 				LWLockRelease(ProcArrayLock);
  
  				/* send the autovacuum worker Back to Old Kent Road */
  				if (kill(pid, SIGINT) < 0)
--- 1086,1114 ----
  				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
  			{
  				int			pid = autovac->pid;
+ 				StringInfoData locktagbuf;
+ 				StringInfoData logbuf;		/* errdetail for server log */
+ 
+ 				initStringInfo(&locktagbuf);
+ 				initStringInfo(&logbuf);
+ 				DescribeLockTag(&locktagbuf, &lock->tag);
+ 				appendStringInfo(&logbuf,
+ 					  _("Process %d waits for %s on %s"),
+ 						 MyProcPid,
+ 						 GetLockmodeName(lock->tag.locktag_lockmethodid,
+ 										 lockmode),
+ 						 locktagbuf.data);
+ 
+ 				/* release lock as quickly as possible */
+ 				LWLockRelease(ProcArrayLock);
  
! 				ereport(LOG,
! 						(errmsg("sending cancel to blocking autovacuum PID %d",
! 							pid),
! 						 errdetail_log("%s", logbuf.data)));
  
! 				pfree(logbuf.data);
! 				pfree(locktagbuf.data);
  
  				/* send the autovacuum worker Back to Old Kent Road */
  				if (kill(pid, SIGINT) < 0)

#13

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#10)

Re: canceling autovacuum task woes

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Looks great. Are you considering backpatching this?

Well, that would certainly make MY life easier. I am not sure whether
it would be in line with project policy, however.

+1 for a backpatch. Otherwise it'll be years before we gain any
information about the unexpected cancels that you think exist.

However, after looking some more at deadlock.c, I wonder whether
(a) this patch gives sufficient detail, and (b) whether there isn't a
problem that's obvious by inspection. It appears to me that as the
blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
needing to be killed even though it may be several graph edges out from
the current proc. This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done. I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.

regards, tom lane

#14

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#13)

Re: canceling autovacuum task woes

On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Looks great. Are you considering backpatching this?

Well, that would certainly make MY life easier. I am not sure whether
it would be in line with project policy, however.

+1 for a backpatch. Otherwise it'll be years before we gain any
information about the unexpected cancels that you think exist

OK, great.

However, after looking some more at deadlock.c, I wonder whether
(a) this patch gives sufficient detail, and (b) whether there isn't a
problem that's obvious by inspection. It appears to me that as the
blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
needing to be killed even though it may be several graph edges out from
the current proc. This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done. I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.

Uggh. Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since presumably if we limit it to things that directly block the current process, you'll always allow the full deadlock_timeout before nuking the autovac worker. So +1 for changing that.

Does an edge in this context mean any lock, or just an ungranted one? I assume the latter, which still leaves the question of where the edges are coming from in the first place.

...Robert

#15

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#14)

Re: canceling autovacuum task woes

Robert Haas <robertmhaas@gmail.com> writes:

On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done. I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.

Uggh. Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since presumably if we limit it to things that directly block the current process, you'll always allow the full deadlock_timeout before nuking the autovac worker. So +1 for changing that.

I think something as simple as the attached would do the trick. I've
verified that this still allows the expected cancel cases, though of
course it's harder to prove that it gives any benefit.

Does an edge in this context mean any lock, or just an ungranted one? I assume the latter, which still leaves the question of where the edges are coming from in the first place.

The deadlock code follows "hard" edges, which mean "A wants a lock that
B has already got", as well as "soft" edges, which mean "A wants a lock
that B also wants, and B is ahead of A in the queue for it". We don't
kill autovacs that are the end of soft edges, which I think is good,
but the fact that we follow them at all makes it a little unclear what
we might reach recursively. Your point about determinacy of the timeout
is probably even a better argument for only allowing the direct blockee
to wield the knife.

regards, tom lane

#16

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Robert Haas (#12)

Re: canceling autovacuum task woes

On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:

On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.

Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.

Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.

New version attached.

The detail message should have a period at the end.

#17

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Peter Eisentraut (#16)

Re: canceling autovacuum task woes

On Mon, Aug 13, 2012 at 11:59 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:

On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Yeah, you're right. So you do get the table name. But you don't get
the cause, which is what you really need to understand why it's
happening. Attached is a patch that adds some more detail.

Uh, what's the added dependency on pgstat.h for? Looks sane to the
eyeball otherwise.

Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.

New version attached.

The detail message should have a period at the end.

Oops. Fixed, sorry it took me so long to get to this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company