We need to log aborted autovacuums

Started by Josh Berkusover 15 years ago51 messageshackers
Jump to latest
#1Josh Berkus
josh@agliodbs.com

Hackers,

I've been trying to diagnose in a production database why certain tables
never get autovacuumed despite having a substantial % of updates. The
obvious reason is locks blocking autovacuum from vacuuming the table ...
but the trick is we don't log such blocking behavior, at all. This
means that there is no possible way for a user to figure out *why* his
tables aren't getting autovacuumed.

And yes, this is a common problem. A quick survey on IRC found 3 active
users on channel (out of 20 or so) who'd encountered it. The current
case I'm looking at is a table with over 70% bloat which hasn't been
autovacuumed since the database was upgraded a month ago.

What I'd like to do is add some logging code to autovacuum.c so that if
log_autovacuum is any value other than -1, failure to vacuum due to
locks gets logged. Does this make sense?

And does anyone have suggestions on how users can diagnose this on older
versions?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#2Joshua D. Drake
jd@commandprompt.com
In reply to: Josh Berkus (#1)
Re: We need to log aborted autovacuums

On Wed, 2010-11-17 at 13:46 -0800, Josh Berkus wrote:

Hackers,

I've been trying to diagnose in a production database why certain tables
never get autovacuumed despite having a substantial % of updates. The
obvious reason is locks blocking autovacuum from vacuuming the table ...
but the trick is we don't log such blocking behavior, at all. This
means that there is no possible way for a user to figure out *why* his
tables aren't getting autovacuumed.

Yes we do log this.

JD

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#1)
Re: We need to log aborted autovacuums

Josh Berkus <josh@agliodbs.com> writes:

I've been trying to diagnose in a production database why certain tables
never get autovacuumed despite having a substantial % of updates. The
obvious reason is locks blocking autovacuum from vacuuming the table ...
but the trick is we don't log such blocking behavior, at all. This
means that there is no possible way for a user to figure out *why* his
tables aren't getting autovacuumed.

And yes, this is a common problem. A quick survey on IRC found 3 active
users on channel (out of 20 or so) who'd encountered it. The current
case I'm looking at is a table with over 70% bloat which hasn't been
autovacuumed since the database was upgraded a month ago.

What I'd like to do is add some logging code to autovacuum.c so that if
log_autovacuum is any value other than -1, failure to vacuum due to
locks gets logged. Does this make sense?

It's hard to tell, because you're just handwaving about what it is you
think isn't being logged; nor is it clear whether you have any evidence
that locks are the problem. Offhand I'd think it at least as likely
that autovacuum thinks it doesn't need to do anything, perhaps because
of a statistics issue. There *is* an elog(DEBUG3) in autovacuum.c
that reports whether autovac thinks a table needs vacuumed/analyzed ...
maybe that needs to be a tad more user-accessible.

regards, tom lane

#4Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#3)
Re: We need to log aborted autovacuums

It's hard to tell, because you're just handwaving about what it is you
think isn't being logged; nor is it clear whether you have any evidence
that locks are the problem. Offhand I'd think it at least as likely
that autovacuum thinks it doesn't need to do anything, perhaps because
of a statistics issue. There *is* an elog(DEBUG3) in autovacuum.c
that reports whether autovac thinks a table needs vacuumed/analyzed ...
maybe that needs to be a tad more user-accessible.

Yeah, it would be really good to be able to log that without bumping the
log levels of the server in general to DEBUG3. On a busy production
server, using any of the DEBUG levels is pretty much out of the question
... they can produce up to 1GB/minute in output.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#4)
Re: We need to log aborted autovacuums

Josh Berkus <josh@agliodbs.com> writes:

There *is* an elog(DEBUG3) in autovacuum.c
that reports whether autovac thinks a table needs vacuumed/analyzed ...
maybe that needs to be a tad more user-accessible.

Yeah, it would be really good to be able to log that without bumping the
log levels of the server in general to DEBUG3. On a busy production
server, using any of the DEBUG levels is pretty much out of the question
... they can produce up to 1GB/minute in output.

Well, the way to deal with that would be to add a GUC that enables
reporting of those messages at LOG level. But it's a bit hard to argue
that we need such a thing without more evidence. Maybe you could just
locally modify the DEBUG3 to LOG and see whether it teaches you
anything?

regards, tom lane

#6Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Tom Lane (#5)
Re: We need to log aborted autovacuums

On Thu, Nov 18, 2010 at 08:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, it would be really good to be able to log that without bumping the
log levels of the server in general to DEBUG3.

Well, the way to deal with that would be to add a GUC that enables
reporting of those messages at LOG level.  But it's a bit hard to argue
that we need such a thing without more evidence.  Maybe you could just
locally modify the DEBUG3 to LOG and see whether it teaches you
anything?

How about adding a special role for autovacuum, and running autovacuum
by the role instead of the database owner? If we have "autovacuum" role
for autovacuum processes, we could set log_min_messages to DEBUG3 for
only the role with per-user configuration.

--
Itagaki Takahiro

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Itagaki Takahiro (#6)
Re: We need to log aborted autovacuums

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:

On Thu, Nov 18, 2010 at 08:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, the way to deal with that would be to add a GUC that enables
reporting of those messages at LOG level.  But it's a bit hard to argue
that we need such a thing without more evidence.  Maybe you could just
locally modify the DEBUG3 to LOG and see whether it teaches you
anything?

How about adding a special role for autovacuum, and running autovacuum
by the role instead of the database owner? If we have "autovacuum" role
for autovacuum processes, we could set log_min_messages to DEBUG3 for
only the role with per-user configuration.

That seems like a major kluge ... and anyway it doesn't fix the problem,
because DEBUG3 is still going to result in a lot of unwanted log output,
even if it's confined to autovacuum.

regards, tom lane

#8Greg Smith
gsmith@gregsmith.com
In reply to: Josh Berkus (#1)
Re: We need to log aborted autovacuums

Josh Berkus wrote:

I've been trying to diagnose in a production database why certain tables
never get autovacuumed despite having a substantial % of updates. The
obvious reason is locks blocking autovacuum from vacuuming the table ...

Missed this dicussion when it popped up but have plenty to say about it
now. What I do here is look for such anomolies using
pg_stat_user_tables, that the dead rows number has exceeded the vacuum
threshold. That comparison is logged in the code at DEBUG3:

elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f
(threshold %.0f)",
NameStr(classForm->relname),
vactuples, vacthresh, anltuples, anlthresh);

But a rough computation isn't too hard to derive in a report, so long as
you haven't customized per-table parameters. As you suggested in your
message, the really bad cases here stick out a whole lot. If you pay
the slightest amount of attention to the dead row percentages they jump
right out at you. This all works easily on any version back to 8.3.
Not having as much relevant data stored in pg_stat_user_tables makes the
problem cases less obvious to spot in older versions.

If I start seeing these badly maintained tables and suspect locking is
getting in the way, I then dump traces from pg_locks+pg_stat_activity
often enough that I can estimate how often someone has an interfering
lock and what they're doing.

Should the log level on this message go up from DEBUG3? I could see
rewriting it so that it logs at DEBUG1 instead when
Log_autovacuum_min_duration is set *and* when the trigger threshold is
crossed, and at DEBUG3 the rest of the time. Given you can derive this
with a bit of work in userland, I don't see this even being justified as
an INFO or LOG level message. Anytime I can script a SQL-level monitor
for something that's easy to tie into Nagios or something, I greatly
prefer that to log file scraping for it anyway.

What I'd like to do is add some logging code to autovacuum.c so that if
log_autovacuum is any value other than -1, failure to vacuum due to
locks gets logged. Does this make sense?

The general idea is interesting and probably more productive for the
situation you theorize is happening then messing with the logging
discussed above. But that's not where the code needs to go--the lock
isn't opened until much further down the function call stack. Attached
quickie and only tested for compilation patch probably does what you
want here. Since this would eliminate the messy follow-up step I
sometimes have gone through, dumping pg_locks data to confirm or rule
out locking issues messing with AV processing, I can see some potential
that it may have simplified situations I've ran into in the past. And
it's not out of line with the logging verbosity of similar failure mode
tests that follow it. Right now failure to acquire a lock is just not
considered a log-worthy issue, and I agree that it's worth considering
whether it should be.

If you could gather more info on whether this logging catches the
problem cases you're seeing, that would really be the right test for the
patch's usefulness. I'd give you solid 50/50 odds that you've correctly
diagnosed the issue, and knowing for sure would make advocating for this
logging a pretty easy sell to me at least.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachments:

av-lock-failure-v1.difftext/x-patch; name=av-lock-failure-v1.diffDownload+8-0
#9Magnus Hagander
magnus@hagander.net
In reply to: Greg Smith (#8)
Re: We need to log aborted autovacuums

On Wed, Jan 5, 2011 at 07:55, Greg Smith <greg@2ndquadrant.com> wrote:

<snip>

a bit of work in userland, I don't see this even being justified as an INFO
or LOG level message.  Anytime I can script a SQL-level monitor for
something that's easy to tie into Nagios or something, I greatly prefer that
to log file scraping for it anyway.

+<INF-1>

Log files can be very useful for details, but they suck for noticing
the pattern in the first place :-)

<snip>

verbosity of similar failure mode tests that follow it.  Right now failure
to acquire a lock is just not considered a log-worthy issue, and I agree
that it's worth considering whether it should be.

Or should it perhaps be a per-table counter in pg_stat_user_tables,
given your statement above?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#10Josh Berkus
josh@agliodbs.com
In reply to: Greg Smith (#8)
Re: We need to log aborted autovacuums

If you could gather more info on whether this logging catches the
problem cases you're seeing, that would really be the right test for the
patch's usefulness. I'd give you solid 50/50 odds that you've correctly
diagnosed the issue, and knowing for sure would make advocating for this
logging a pretty easy sell to me at least.

Well, I already resolved the issue through polling pg_locks. The issue
was IIT sessions which lasted up to an hour, which we fixed in the
application, so I don't have the test case available anymore. I'd have
to generate a synthetic test case, and since I agree that a SQL-callable
diagnostic is superior to logging, I don't think we should pursue the
log levels further.

Or should it perhaps be a per-table counter in pg_stat_user_tables,
given your statement above?

Or even a timestamp: last_autovacuum_attempt, which would record the
last time autovacuum was tried. If that's fairly recent and you have a
large number of dead rows, you know what kind of problem you have and
can turn on debug.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#11Robert Treat
xzilla@users.sourceforge.net
In reply to: Josh Berkus (#10)
Re: We need to log aborted autovacuums

On Wed, Jan 5, 2011 at 2:27 PM, Josh Berkus <josh@agliodbs.com> wrote:

If you could gather more info on whether this logging catches the
problem cases you're seeing, that would really be the right test for the
patch's usefulness.  I'd give you solid 50/50 odds that you've correctly
diagnosed the issue, and knowing for sure would make advocating for this
logging a pretty easy sell to me at least.

Well, I already resolved the issue through polling pg_locks.  The issue
was IIT sessions which lasted up to an hour, which we fixed in the
application, so I don't have the test case available anymore.  I'd have
to generate a synthetic test case, and since I agree that a SQL-callable
diagnostic is superior to logging, I don't think we should pursue the
log levels further.

This is a great use case for user level tracing support. Add a probe
around these bits, and you can capture the information when you need
it.

Robert Treat
http://www.xzilla.net

#12Josh Berkus
josh@agliodbs.com
In reply to: Robert Treat (#11)
Re: We need to log aborted autovacuums

This is a great use case for user level tracing support. Add a probe
around these bits, and you can capture the information when you need
it.

Yeah, would be lovely if user-level tracing existed on all platforms.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#13Greg Smith
gsmith@gregsmith.com
In reply to: Robert Treat (#11)
Re: We need to log aborted autovacuums

Robert Treat wrote:

This is a great use case for user level tracing support. Add a probe
around these bits, and you can capture the information when you need
it.

Sure. I would also like a pony.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#14Greg Smith
gsmith@gregsmith.com
In reply to: Josh Berkus (#10)
Re: We need to log aborted autovacuums

Josh Berkus wrote:

Or should it perhaps be a per-table counter in pg_stat_user_tables,
given your statement above?

Or even a timestamp: last_autovacuum_attempt, which would record the
last time autovacuum was tried. If that's fairly recent and you have a
large number of dead rows, you know what kind of problem you have and
can turn on debug.

These are both reasonable ideas. But there was just some kickback on
Tomas's "keeping timestamp of the lasts stats reset" patch recently,
from the perspective of trying to limit per-table stats bloat. I think
it's relatively easy to make a case that this situation is difficult
enough to diagnose that a little bit of extra low-level logging is
worthwhile. That Josh and I have both been bit by it enough to be
thinking about patches to make it easier to diagnost suggests it's
obviously too hard to nail down. But is this so common and difficult to
recognize that it's worth making all the table stats bigger? That's a
harder call.

It's already possible to detect the main symptom--dead row percentage is
much higher than the autovacuum threshold, but there's been no recent
autovacuum. That makes me less enthusiastic that there's such a genuine
need to justify the overhead of storing more table stats just to detect
the same thing a little more easily. I've been playing with the Munin
PG plug-in more recently, and I was just thinking of adding a dead row
trend graph/threshold to it to address this general area instead.

We could argue both sides of the trade-off of tracking this directly in
stats for some time, and I'd never expect there to be a clear victory
for either perspective. I've run into this vacuum problem a few times,
but certainly less than I've run into "why is the stats table so huge?"

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#15Josh Berkus
josh@agliodbs.com
In reply to: Greg Smith (#14)
Re: We need to log aborted autovacuums

Greg,

It's already possible to detect the main symptom--dead row percentage is
much higher than the autovacuum threshold, but there's been no recent
autovacuum. That makes me less enthusiastic that there's such a genuine
need to justify the overhead of storing more table stats just to detect
the same thing a little more easily.

The problem is that while this gives you the symptoms, it doesn't give
you the cause. The lack of vacuum could be occurring for any of 4 reasons:

1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts. However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time. This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.

It occurs to me that another way of diagnosis would simply be a way to
cause the autovac daemon to spit out output we could camp on, *without*
requiring the huge volumes of output also required for DEBUG3. This
brings us back to the logging idea again.

We could argue both sides of the trade-off of tracking this directly in
stats for some time, and I'd never expect there to be a clear victory
for either perspective. I've run into this vacuum problem a few times,
but certainly less than I've run into "why is the stats table so huge?"

I really don't think that argument applies to either patch;
last_autovac_attempt *or* the last_stats_reset time, since neither event
is expected to occur frequently. If you have last_autovac_attempt (for
example) being updated frequently, you clearly had a db problem bigger
than the size of the stats table.

Given that our road ahead necessarily includes adding more and more
monitoring and admin data to PostgreSQL (because our DBA users demand
it), I think that we should give some thought to the issue of storing
DBA stats in general. By DBA stats I mean statistics which aren't used
in query planning, but are used in monitoring, trending, and
troubleshooting. I'm thinking these ought to have their own relation or
relations.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#16Greg Smith
gsmith@gregsmith.com
In reply to: Josh Berkus (#15)
Re: We need to log aborted autovacuums

Josh Berkus wrote:

It occurs to me that another way of diagnosis would simply be a way to
cause the autovac daemon to spit out output we could camp on, *without*
requiring the huge volumes of output also required for DEBUG3. This
brings us back to the logging idea again.

Right. I don't know how much you looked at the little patch I threw out
there yet, but I think it's a reasonable 90% solution to the problem of
"how do I distinguish between a locking block and other reasons for AV
not running?" without generating a huge amount of log data. Since
there's no real overhead or code complexity added by it, it's a
relatively easy thing to throw in the codebase without annoying anyone.

I really don't think that argument applies to either patch;
last_autovac_attempt *or* the last_stats_reset time, since neither event
is expected to occur frequently. If you have last_autovac_attempt (for
example) being updated frequently, you clearly had a db problem bigger
than the size of the stats table.

It just returns to the usual "why make everyone pay overhead for
something that only happens to a small percentage of users?"
argument.[1]Silly aside: I was thinking today that I should draw a chart of all the common objections to code that show up here, looking like those maps you see when walking into a mall. Then we can give a copy to new submitters with a big "you are here" X marking where they have inadvertently wandered onto. I personally have all sorts of additional data I'd like to
instrument in myriad obtrusive spots of code. All I'm saying is that
this one in particular I don't quite feel strongly enough about to make
it the spot I try to establish that foothold at. If a stats tracking
patch for this appeared that seemed reasonably well written (i.e. it
tries to keep the added value NULL whenever it's not relevant), I'd be
in favor of it going in. I'm just not so excited about the idea that
I'm going to write it. I'm lukewarm to per-table last_stats_reset time
just because there's so few places I'd actually use it in practice,
relative to the guaranteed overhead it adds.

[1]: Silly aside: I was thinking today that I should draw a chart of all the common objections to code that show up here, looking like those maps you see when walking into a mall. Then we can give a copy to new submitters with a big "you are here" X marking where they have inadvertently wandered onto.
the common objections to code that show up here, looking like those maps
you see when walking into a mall. Then we can give a copy to new
submitters with a big "you are here" X marking where they have
inadvertently wandered onto.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#17David Fetter
david@fetter.org
In reply to: Greg Smith (#16)
Re: We need to log aborted autovacuums

On Fri, Jan 07, 2011 at 08:15:12PM -0500, Greg Smith wrote:

[1] Silly aside: I was thinking today that I should draw a chart of
all the common objections to code that show up here, looking like
those maps you see when walking into a mall. Then we can give a
copy to new submitters with a big "you are here" X marking where
they have inadvertently wandered onto.

Actually, that'd make a great flow chart on a T-shirt :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#18Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: David Fetter (#17)
Re: We need to log aborted autovacuums

David Fetter <david@fetter.org> writes:

On Fri, Jan 07, 2011 at 08:15:12PM -0500, Greg Smith wrote:

[1] Silly aside: I was thinking today that I should draw a chart of
all the common objections to code that show up here, looking like
those maps you see when walking into a mall. Then we can give a
copy to new submitters with a big "you are here" X marking where
they have inadvertently wandered onto.

Actually, that'd make a great flow chart on a T-shirt :)

Yeah, here's some more visual inspiration to get that:

http://xkcd.com/844/

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#19Greg Smith
gsmith@gregsmith.com
In reply to: Josh Berkus (#15)
Re: We need to log aborted autovacuums

Josh Berkus wrote:

The lack of vacuum could be occurring for any of 4 reasons:

1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts. However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time. This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.

Since no one coded up something more exciting, I just did a round of
self-review of the little logging patch I sent upthread to see if it was
a worthy CF candidate. I agree that *something* should be done, and I'd
rather have a logging-based solution than nothing at all here.

So that patch is, to use a polite UK term I've adopted recently,
rubbish. But unless I'm confused (it is late), I think there's a
problem here that's much worse than what you described--in the case
where someone has grabbed a really large lock on a table that needs
cleanup at least. I started with the following test case:

SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables
WHERE relname='t';
DELETE FROM t WHERE s<50000;
\q

Which, so long as the min duration is 0, does mostly what I expected. I
followed immediately by starting a new session, doing:

BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck
there idle in a transaction holding a lock. I threw some extra logging
in the code path leading up to where we'd both like the lack of lock
acquisition to be logged, and it shows the following (at DEBUG3):

DEBUG: t: vac: 49999 (threshold 50), anl: 149999 (threshold 50)
DEBUG: autovacuum: t: vac needed
INFO: vacuum_rel: processing 16528
INFO: vacuum_rel: trying to open relation 16528

All but the first one of those lines are not in the current code, and as
noted already that one existing line is at DEBUG3. This was trying to
simulate the case you were complaining about: autovacuum has been
triggered, it knows there is work to be done, and it can't do said work.

It hasn't triggered the error message I tried to add yet through,
because it's not hitting that spot. Look where it's actually blocked at:

$ ps -eaf | grep postgres
gsmith 5490 4880 0 04:09 ? 00:00:00 postgres: autovacuum
worker process gsmith waiting

$ psql -c "select procpid,now() - query_start as runtime,current_query
from pg_stat_activity where current_query like 'autovacuum%'"
procpid | runtime | current_query
---------+-----------------+-------------------------------------
5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without any
entry in the logs saying as much. I've actually tied down an autovacuum
worker and kept it from doing anything else until that lock is
released! When vacuum_rel inside of vacuum.c does this on a relation it
wants to acquire a lock on:

onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to
get the lock it wants. The abort code path I tried to insert logging
into only occurs if the relation was deleted. Watch what's happened
while I've been typing:

procpid | runtime | current_query
---------+-----------------+-------------------------------------
5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially,
for over 35 minutes. I'd bet that if I was a malicious user, I could
make AV come to a halt altogether for the whole server with this
approach, even with relatively limited rights on the database.

Does try_relation_open need to have a lock acquisition timeout when AV
is calling it? I don't see any other way to get a log mesage noting
when this problem passed by in there, you can only see it live in these
process view. And the current behavior doesn't feel right to me.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#19)
Re: We need to log aborted autovacuums

Greg Smith <greg@2ndquadrant.com> writes:

Does try_relation_open need to have a lock acquisition timeout when AV
is calling it?

Hmm. I think when looking at the AV code, I've always subconsciously
assumed that try_relation_open would fail immediately if it couldn't get
the lock. That certainly seems like it would be a more appropriate way
to behave than delaying indefinitely.

In practice, the scenario you're worried about seems unlikely to persist
indefinitely: as soon as someone else comes along and blocks behind
autovacuum's request, the deadlock checker will kick AV off the lock
queue. But it would probably be better not to depend on that.

regards, tom lane

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#22Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#22)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#24)
#26Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#26)
#28Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#23)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#28)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#26)
#31Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#30)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#31)
#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#35)
#37Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#34)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#37)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Josh Berkus (#37)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#36)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#43)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#1)
#46Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#44)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#46)
#48Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#48)
#50Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#49)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#50)