Lock problem with autovacuum truncating heap
We have run across a problem with autovacuum that occurs when it can
truncate off a large amount of empty blocks. It behaves different in
version 9.0 than previous versions. Both behaviors are bad.
Consider a relation receives constant inserts/updates that are satisfied
using freespace at the beginning of the heap. Delete operations now have
removed large amounts of tuples at the end of the relation. The
following autovacuum will find a large amount of blocks at the end, that
can be truncated.
Vacuumlazy now takes out an access exclusive lock and scans the relation
*backwards* to find out if concurrent access has created new tuples in
the to be truncated space. Apparently such a backward scan in 8K blocks
isn't really a good access strategy.
Up to 8.4, it simply holds the lock until it is done, which in our case
stalled a production system for 12 minutes! This is obviously bad.
In 9.0, the autovacuum process will be aborted about 1 second after
another transaction starts waiting for a lock. The result is that even a
simple INSERT will take 1 second. The autovacuum restarts shortly after
and somehow gets to a point, where it will cause this 1 second hiccup
ever 2 minutes. This is slightly better but still far from optimal in a
world, where transaction response times are measured in milliseconds.
My current idea for a fix is to modify lazy_truncate_heap(). It does
acquire and release the exclusive lock, so it should be possible to do
this in smaller chunks, releasing and reacquiring the lock so that
client transactions can get their work done as well. At the same time I
would change count_nondeletable_pages() so that it uses a forward scan
direction (if that leads to a speedup).
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On Sat, Mar 26, 2011 at 2:30 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
My current idea for a fix is to modify lazy_truncate_heap(). It does acquire
and release the exclusive lock, so it should be possible to do this in
smaller chunks, releasing and reacquiring the lock so that client
transactions can get their work done as well.
Agreed, presumably with vacuum delay in there as well?
At the same time I would
change count_nondeletable_pages() so that it uses a forward scan direction
(if that leads to a speedup).
Do we need that? Linux readahead works in both directions doesn't it?
Guess it wouldn't hurt too much.
BTW does it read the blocks at that point using a buffer strategy?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Mar 27, 2011 at 01:12, Simon Riggs <simon@2ndquadrant.com> wrote:
At the same time I would
change count_nondeletable_pages() so that it uses a forward scan direction
(if that leads to a speedup).
+1.
Do we need that? Linux readahead works in both directions doesn't it?
Guess it wouldn't hurt too much.
Yes, probably. AFAIK, RHEL 5 cannot readahead in backward scans.
It might be improved in the latest kernel, but it would be safe
not to rely on kernels except simple forward scans.
--
Itagaki Takahiro
On Mar 26, 2011, at 1:44 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote:
On Sun, Mar 27, 2011 at 01:12, Simon Riggs <simon@2ndquadrant.com> wrote:
At the same time I would
change count_nondeletable_pages() so that it uses a forward scan direction
(if that leads to a speedup).+1.
Hmm. That would speed up truncations that are large relative to the table size, but slow down small truncations. And small truncations are likely to be more common than big ones.
Maybe we could do a mix... back up 16MB and scan forward; if all those pages are empty then back up 16MB from the start point and scan forward from there. Or whatever we think the right chunk size is to get some benefit from kernel readahead without making the "truncate 1 block" case slow.
...Robert
On 3/26/2011 12:12 PM, Simon Riggs wrote:
On Sat, Mar 26, 2011 at 2:30 PM, Jan Wieck<JanWieck@yahoo.com> wrote:
My current idea for a fix is to modify lazy_truncate_heap(). It does acquire
and release the exclusive lock, so it should be possible to do this in
smaller chunks, releasing and reacquiring the lock so that client
transactions can get their work done as well.Agreed, presumably with vacuum delay in there as well?
Not sure about that. My theory is that unless somebody needs access to
that table, just have at it like it is now.
The current implementation seems to assume that the blocks, checked for
being empty, are still found in memory (vacuum just scanned them). And
that seems to be correct most of the time, in which case adding vacuum
delay only gives more time that the blocks get evicted and have to be
read back in.
At the same time I would
change count_nondeletable_pages() so that it uses a forward scan direction
(if that leads to a speedup).Do we need that? Linux readahead works in both directions doesn't it?
Guess it wouldn't hurt too much.BTW does it read the blocks at that point using a buffer strategy?
Is reading a file backwards "in 8K blocks" actually an access pattern,
that may confuse buffer strategies?
I don't know. I also don't know if what I am suggesting is much better.
If you think about it, I merely suggested to "try" and do the same
access pattern with larger chunks. We need to run some tests to find out.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On 3/26/2011 3:17 PM, Robert Haas wrote:
On Mar 26, 2011, at 1:44 PM, Itagaki Takahiro<itagaki.takahiro@gmail.com> wrote:
On Sun, Mar 27, 2011 at 01:12, Simon Riggs<simon@2ndquadrant.com> wrote:
At the same time I would
change count_nondeletable_pages() so that it uses a forward scan direction
(if that leads to a speedup).+1.
Hmm. That would speed up truncations that are large relative to the table size, but slow down small truncations. And small truncations are likely to be more common than big ones.
For small truncations the blocks to check are most likely found in
memory (shared or OS buffer) anyway, in which case the access pattern
should be rather irrelevant.
Maybe we could do a mix... back up 16MB and scan forward; if all those pages are empty then back up 16MB from the start point and scan forward from there. Or whatever we think the right chunk size is to get some benefit from kernel readahead without making the "truncate 1 block" case slow.
That was what I meant. Go in steps of 16-64MB backwards and scan from
there to the current end in forward direction to find a nondeletable
block. In between these steps, release and reacquire the exclusive lock
so that client transactions can get their work done.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On Sat, Mar 26, 2011 at 8:05 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
On 3/26/2011 12:12 PM, Simon Riggs wrote:
On Sat, Mar 26, 2011 at 2:30 PM, Jan Wieck<JanWieck@yahoo.com> wrote:
My current idea for a fix is to modify lazy_truncate_heap(). It does
acquire
and release the exclusive lock, so it should be possible to do this in
smaller chunks, releasing and reacquiring the lock so that client
transactions can get their work done as well.Agreed, presumably with vacuum delay in there as well?
Not sure about that. My theory is that unless somebody needs access to that
table, just have at it like it is now.The current implementation seems to assume that the blocks, checked for
being empty, are still found in memory (vacuum just scanned them). And that
seems to be correct most of the time, in which case adding vacuum delay only
gives more time that the blocks get evicted and have to be read back in.
I think someone fairly clever already thought of that.
vacuum_cost_page_hit = 1 by default, so the cost of accessing pages
still in memory is 1/10th the cost of disk access. So we will only
perform the delay for each chunk if we had to read it from disk.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mar 26, 2011, at 4:16 PM, Jan Wieck <JanWieck@Yahoo.com> wrote:
That was what I meant. Go in steps of 16-64MB backwards and scan from there to the current end in forward direction to find a nondeletable block. In between these steps, release and reacquire the exclusive lock so that client transactions can get their work done.
Well, VACUUM uses a 16MB ring buffer, so anything that size or smaller should hit shared_buffers most of the time.
I wonder though if this might defeat read-behind on operating systems that do have a working implementation. With our current approach each read will end at the point the previous read started, which might be an algorithm somebody is using to detect a backward scan.
...Robert
On Sun, Mar 27, 2011 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mar 26, 2011, at 4:16 PM, Jan Wieck <JanWieck@Yahoo.com> wrote:
That was what I meant. Go in steps of 16-64MB backwards and scan from there to the current end in forward direction to find a nondeletable block. In between these steps, release and reacquire the exclusive lock so that client transactions can get their work done.
Well, VACUUM uses a 16MB ring buffer, so anything that size or smaller should hit shared_buffers most of the time.
I wonder though if this might defeat read-behind on operating systems that do have a working implementation. With our current approach each read will end at the point the previous read started, which might be an algorithm somebody is using to detect a backward scan.
Good point. That means the last 16MB of buffers will be in
shared_buffers. Anything more than that will definitely not be,
because we wrote them out ourselves.
So we should truncate in 16MB chunks also.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 3/27/2011 1:24 PM, Simon Riggs wrote:
So we should truncate in 16MB chunks also.
On a second though, fiddling with the scan direction is probably too
much of a change for back releases anyway. That 8.3/8.4 can get into a
situation, where autovacuum causes a 12 minute freeze of a production
server could be argued as a bug. Likewise that 9.0 will never succeed to
truncate but cause a 1 second hiccup every two minutes.
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further
improve things for future releases.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On Sun, Mar 27, 2011 at 8:25 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further improve
things for future releases.
I like all of:
1) move the truncating to a new transaction just like we currently do
toast tables in a separate transaction from the main vacuum. I'm not
sure why we do it for toast tables but it makes sense here. If we get
killed by autovacuum detecting a lock conflict we want to commit the
changes to pg_class entry so that autovacuum doesn't invoke us again.
2) Don't bother trying to truncate if we've been called from
autovacuum at all. This doesn't help people who run vacuum from a cron
job but it does help anyone who doesn't know what's going on and is
just randomly having their table exclusive-locked at arbitrary times
in the middle of peak production hours. I doubt the truncation really
helps much in normal operation anyways and if you've deleted all the
rows in your table it's not a bad recommendation to say you should run
vacuum manually and not rely on autovacuum in that instance.
3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
of just going back by block backwards.
--
greg
On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further improve
things for future releases.
That seems unsafe - things can change under you while you don't hold the lock...
I kind of like the idea of committing the transaction and then
beginning a new one just to do the truncation. Given the way the
deadlock detector treats autovacuum, the current coding seems quite
risky.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 3/27/2011 6:21 PM, Robert Haas wrote:
On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck<JanWieck@yahoo.com> wrote:
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further improve
things for future releases.That seems unsafe - things can change under you while you don't hold the lock...
The only change relevant in this case would be some concurrent client
extending the relation while we don't hold the lock. A call to
RelationGetNumberOfBlocks() after reacquiring the lock will tell. Safety
reestablished.
I kind of like the idea of committing the transaction and then
beginning a new one just to do the truncation. Given the way the
deadlock detector treats autovacuum, the current coding seems quite
risky.
I don't like a 1,000 ms hiccup in my system, regardless of how many
transaction hoops you make it go through.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On Sun, Mar 27, 2011 at 9:41 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
On 3/27/2011 6:21 PM, Robert Haas wrote:
On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck<JanWieck@yahoo.com> wrote:
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further
improve
things for future releases.That seems unsafe - things can change under you while you don't hold the
lock...The only change relevant in this case would be some concurrent client
extending the relation while we don't hold the lock. A call to
RelationGetNumberOfBlocks() after reacquiring the lock will tell. Safety
reestablished.
I thought that the risk was that someone might write tuples into the
blocks that we're thinking of truncating.
I kind of like the idea of committing the transaction and then
beginning a new one just to do the truncation. Given the way the
deadlock detector treats autovacuum, the current coding seems quite
risky.I don't like a 1,000 ms hiccup in my system, regardless of how many
transaction hoops you make it go through.
I can't argue with that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck <JanWieck@yahoo.com> wrote:
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further improve
things for future releases.
That seems unsafe - things can change under you while you don't hold the lock...
And more to the point, it wouldn't actually fix anything, unless you
chop things up so finely that autovac is never holding the lock for more
than milliseconds. (I believe it wouldn't even be enough if you could
guarantee that autovac didn't hold the lock for more than
deadlock_timeout, because some other process could reach the timeout and
run the deadlock detector very shortly after autovac acquires its lock.)
I don't believe that *any* of what's being discussed here is suitable
material for back-patching. And it's not material for 9.1, either.
The time for rewriting VACUUM for 9.1 was three months ago.
regards, tom lane
Greg Stark <gsstark@mit.edu> writes:
I like all of:
1) move the truncating to a new transaction just like we currently do
toast tables in a separate transaction from the main vacuum.
+1 if we are going to continue the behavior of allowing other
transactions to kick autovac off the exclusive lock. However, if we can
find a way to avoid the need of that, then breaking it into multiple
transactions would just be useless complication (and extra cycles).
In particular, I thought the direction Jan was headed was to release and
reacquire the lock between truncating off limited-size chunks of the
file. If we do that, we probably *don't* want or need to allow autovac
to be booted off the lock more quickly.
2) Don't bother trying to truncate if we've been called from
autovacuum at all.
No, I think that's seriously going in the wrong direction. We are
trying to make autovacuum more useful and transparent, not find new
reasons why people have to use manual vacuuming.
3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
of just going back by block backwards.
Maybe. I'd want to see some experimental evidence justifying the choice
of chunk size; I'm pretty sure this will become counterproductive once
the chunk size is too large.
regards, tom lane
On 3/27/2011 9:51 PM, Robert Haas wrote:
On Sun, Mar 27, 2011 at 9:41 PM, Jan Wieck<JanWieck@yahoo.com> wrote:
On 3/27/2011 6:21 PM, Robert Haas wrote:
On Sun, Mar 27, 2011 at 3:25 PM, Jan Wieck<JanWieck@yahoo.com> wrote:
Since we are talking about stable releases, I think just releasing and
reacquiring the exclusive lock is enough. We can then try to further
improve
things for future releases.That seems unsafe - things can change under you while you don't hold the
lock...The only change relevant in this case would be some concurrent client
extending the relation while we don't hold the lock. A call to
RelationGetNumberOfBlocks() after reacquiring the lock will tell. Safety
reestablished.I thought that the risk was that someone might write tuples into the
blocks that we're thinking of truncating.
Currently the risk is that while vacuum is doing its main work, someone
can either extend the relation or reuse space inside one of the empty
blocks (that are about to be truncated away). That is why the function
lazy_truncate_heap() does the following:
1) acquire exclusive lock
2) check via RelationGetNumberOfBlocks() if it has been extended
before locking - abort if so
3) check via count_nondeletable_pages() what the highest block
in the to be truncated range is, that contains a (newly created)
tuple
4) truncate the relation
5) release the lock
The function count_nondeletable_pages() is the one doing the block wise
reverse scan. It does check for interrupts and that is the place, where
the deadlock code will boot vacuum.
What I am proposing is to put all those 5 steps into a loop that tries
to bite off smaller bits from the end of the table, instead of trying to
swallow the whole dead space at once.
count_nondeletable_pages() is a static function and only called from
lazy_truncate_heap(), so fiddling with the scan direction inside of it
would be totally safe from a functional side effect point of view. Doing
so or not depends on whether reversing its scan direction does have a
performance benefit or not. I agree with Tom that at some "chunk" size,
the effect might be negative. That is because currently it scans
backwards and returns at the first block containing a tuple. To scan
forward, it has to scan all the blocks, remembering the last that
contained a tuple.
I don't like a 1,000 ms hiccup in my system, regardless of how many
transaction hoops you make it go through.I can't argue with that.
I assumed we have a consensus that both, locking a system for 10+
minutes as well as having a 1,000ms hiccup every 2 minutes, are problems
we need to fix.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On 3/27/2011 10:43 PM, Tom Lane wrote:
In particular, I thought the direction Jan was headed was to release and
reacquire the lock between truncating off limited-size chunks of the
file. If we do that, we probably *don't* want or need to allow autovac
to be booted off the lock more quickly.
That is correct.
3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
of just going back by block backwards.Maybe. I'd want to see some experimental evidence justifying the choice
of chunk size; I'm pretty sure this will become counterproductive once
the chunk size is too large.
Me too, which is why that part of my proposal is highly questionable and
requires a lot of evidence to be even remotely considered for back releases.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
On 3/28/2011 12:35 PM, Jan Wieck wrote:
On 3/27/2011 10:43 PM, Tom Lane wrote:
In particular, I thought the direction Jan was headed was to release and
reacquire the lock between truncating off limited-size chunks of the
file. If we do that, we probably *don't* want or need to allow autovac
to be booted off the lock more quickly.That is correct.
3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
of just going back by block backwards.Maybe. I'd want to see some experimental evidence justifying the choice
of chunk size; I'm pretty sure this will become counterproductive once
the chunk size is too large.Me too, which is why that part of my proposal is highly questionable and
requires a lot of evidence to be even remotely considered for back releases.
Attached is a patch against HEAD that implements the part that truncates
the heap in small batches (512 pages at a time) without fiddling with
the scan direction.
It does several retries when attempting to get the exclusive lock. This
is because when doing it this way I discovered that locks queued up
behind the exclusive lock held by autovacuum make it too likely that it
fails after just a few batches.
I am going to see what a similar logic will do to 8.4, where the
exclusive lock has far more severe consequences to client connections.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
Attachments:
master-autovac-lock.difftext/x-patch; name=master-autovac-lock.diffDownload
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a5c024c..a3e9a99 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -63,8 +63,17 @@
* REL_TRUNCATE_MINIMUM or (relsize / REL_TRUNCATE_FRACTION) (whichever
* is less) potentially-freeable pages.
*/
-#define REL_TRUNCATE_MINIMUM 1000
-#define REL_TRUNCATE_FRACTION 16
+#define REL_TRUNCATE_MINIMUM 1000
+#define REL_TRUNCATE_FRACTION 16
+
+/*
+ * When truncating a heap, we need an exclusive lock while rescanning
+ * the relation for newly created rows. We do it in batches of pages
+ * to minimize interference between autovacuum and concurrent transactions.
+ */
+#define REL_TRUNCATE_BATCHSIZE 512
+#define REL_TRUNCATE_LOCKATTEMPTS 10
+#define REL_TRUNCATE_LOCKWAIT 50
/*
* Guesstimation of number of dead tuples per page. This is used to
@@ -1017,72 +1026,140 @@ lazy_cleanup_index(Relation indrel,
static void
lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
{
+ BlockNumber orig_rel_pages = vacrelstats->rel_pages;
BlockNumber old_rel_pages = vacrelstats->rel_pages;
- BlockNumber new_rel_pages;
+ BlockNumber old_nonempty_pages = vacrelstats->nonempty_pages;
+ BlockNumber new_rel_pages = 0;
+ int retry;
PGRUsage ru0;
pg_rusage_init(&ru0);
- /*
- * We need full exclusive lock on the relation in order to do truncation.
- * If we can't get it, give up rather than waiting --- we don't want to
- * block other backends, and we don't want to deadlock (which is quite
- * possible considering we already hold a lower-grade lock).
- */
- if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
- return;
-
- /*
- * Now that we have exclusive lock, look to see if the rel has grown
- * whilst we were vacuuming with non-exclusive lock. If so, give up; the
- * newly added pages presumably contain non-deletable tuples.
- */
- new_rel_pages = RelationGetNumberOfBlocks(onerel);
- if (new_rel_pages != old_rel_pages)
+ vacrelstats->nonempty_pages = old_rel_pages;
+ while (vacrelstats->nonempty_pages > old_nonempty_pages)
{
- /* might as well use the latest news when we update pg_class stats */
- vacrelstats->rel_pages = new_rel_pages;
- UnlockRelation(onerel, AccessExclusiveLock);
- return;
- }
+ /*
+ * We need full exclusive lock on the relation in order to do
+ * truncation. Rather than waiting we retry this a number of
+ * times --- we don't want to block other backends, and we don't
+ * want to deadlock (which is quite possible considering we
+ * already hold a lower-grade lock).
+ */
+ for (retry = 1; retry <= REL_TRUNCATE_LOCKATTEMPTS; retry++)
+ {
+ CHECK_FOR_INTERRUPTS();
- /*
- * Scan backwards from the end to verify that the end pages actually
- * contain no tuples. This is *necessary*, not optional, because other
- * backends could have added tuples to these pages whilst we were
- * vacuuming.
- */
- new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
+ if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
+ {
+ if (retry == REL_TRUNCATE_LOCKATTEMPTS)
+ {
+ vacrelstats->nonempty_pages = old_nonempty_pages;
- if (new_rel_pages >= old_rel_pages)
- {
- /* can't do anything after all */
- UnlockRelation(onerel, AccessExclusiveLock);
- return;
- }
+ if (old_rel_pages < orig_rel_pages)
+ {
+ ereport(elevel,
+ (errmsg("\"%s\": truncated %u to %u pages",
+ RelationGetRelationName(onerel),
+ orig_rel_pages, old_rel_pages),
+ errdetail("%s.",
+ pg_rusage_show(&ru0))));
+ }
- /*
- * Okay to truncate.
- */
- RelationTruncate(onerel, new_rel_pages);
+ return;
+ }
+ else
+ pg_usleep(REL_TRUNCATE_LOCKWAIT * 1000L);
+ }
+ else
+ break;
+ }
- /*
- * We can release the exclusive lock as soon as we have truncated. Other
- * backends can't safely access the relation until they have processed the
- * smgr invalidation that smgrtruncate sent out ... but that should happen
- * as part of standard invalidation processing once they acquire lock on
- * the relation.
- */
- UnlockRelation(onerel, AccessExclusiveLock);
+ /*
+ * Now that we have exclusive lock, look to see if the rel has grown
+ * whilst we did not hold an exclusive lock. If so, give up; the
+ * newly added pages presumably contain non-deletable tuples.
+ */
+ new_rel_pages = RelationGetNumberOfBlocks(onerel);
+ if (new_rel_pages != old_rel_pages)
+ {
+ /* We can use the latest news when we update pg_class stats */
+ vacrelstats->rel_pages = new_rel_pages;
+ UnlockRelation(onerel, AccessExclusiveLock);
+
+ if (old_rel_pages < orig_rel_pages)
+ {
+ ereport(elevel,
+ (errmsg("\"%s\": truncated %u to %u pages",
+ RelationGetRelationName(onerel),
+ orig_rel_pages, old_rel_pages),
+ errdetail("%s.",
+ pg_rusage_show(&ru0))));
+ }
+
+ return;
+ }
+
+ /*
+ * Adjust vacrelstats so that we at maximum scan one batch
+ * size in count_nondeletable_pages().
+ */
+ if ((vacrelstats->nonempty_pages - old_nonempty_pages) >
+ REL_TRUNCATE_BATCHSIZE)
+ vacrelstats->nonempty_pages -= REL_TRUNCATE_BATCHSIZE;
+ else
+ vacrelstats->nonempty_pages = old_nonempty_pages;
+
+ /*
+ * Scan backwards from the end to verify that the end pages actually
+ * contain no tuples. This is *necessary*, not optional, because other
+ * backends could have added tuples to these pages whilst we were
+ * vacuuming.
+ */
+ new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
+
+ if (new_rel_pages >= old_rel_pages)
+ {
+ /* can't do anything after all */
+ UnlockRelation(onerel, AccessExclusiveLock);
+
+ if (old_rel_pages < orig_rel_pages)
+ {
+ ereport(elevel,
+ (errmsg("\"%s\": truncated %u to %u pages",
+ RelationGetRelationName(onerel),
+ orig_rel_pages, old_rel_pages),
+ errdetail("%s.",
+ pg_rusage_show(&ru0))));
+ }
+
+ return;
+ }
+
+ /*
+ * Okay to truncate this batch.
+ */
+ RelationTruncate(onerel, new_rel_pages);
+
+ /*
+ * We can release the exclusive lock as soon as we have truncated.
+ * Other backends can't safely access the relation until they have
+ * processed the smgr invalidation that smgrtruncate sent out ...
+ * but that should happen as part of standard invalidation processing
+ * once they acquire lock on the relation.
+ */
+ UnlockRelation(onerel, AccessExclusiveLock);
+
+ /* update statistics */
+ vacrelstats->rel_pages = new_rel_pages;
+ vacrelstats->pages_removed = orig_rel_pages - new_rel_pages;
+ old_rel_pages = new_rel_pages;
- /* update statistics */
- vacrelstats->rel_pages = new_rel_pages;
- vacrelstats->pages_removed = old_rel_pages - new_rel_pages;
+ }
ereport(elevel,
(errmsg("\"%s\": truncated %u to %u pages",
RelationGetRelationName(onerel),
- old_rel_pages, new_rel_pages),
+ orig_rel_pages, new_rel_pages),
errdetail("%s.",
pg_rusage_show(&ru0))));
}
On Mar 27, 2011, at 9:43 PM, Tom Lane wrote:
1) move the truncating to a new transaction just like we currently do
toast tables in a separate transaction from the main vacuum.+1 if we are going to continue the behavior of allowing other
transactions to kick autovac off the exclusive lock. However, if we can
find a way to avoid the need of that, then breaking it into multiple
transactions would just be useless complication (and extra cycles).In particular, I thought the direction Jan was headed was to release and
reacquire the lock between truncating off limited-size chunks of the
file. If we do that, we probably *don't* want or need to allow autovac
to be booted off the lock more quickly.
<snip>
3) Scanning backwards 8MB at a time scanning each 8MB forwards instead
of just going back by block backwards.Maybe. I'd want to see some experimental evidence justifying the choice
of chunk size; I'm pretty sure this will become counterproductive once
the chunk size is too large.
I have an idea that might be better than protecting truncation by exclusive-locking the table: a "virtual end of relation" pointer we can lock that makes the truncation fast and easy? The idea is that vacuum would grab the lock protecting that pointer, see what the last valid page actually is, and then truncate everything else. Incrementing that pointer is one more thing we'd have to do when extending a relation, but it's just an increment (and AFAIK we already have a lock for extending a rel).
The challenge would come in actually moving that pointer backwards. My thought is that vacuum can simply set it back to the last known page with data and set a bit indicating that the pointer *might* be ahead of the actual valid end of the relation. Anything that wants to extend the relation when that bit is set would need to
- See if the next page actually exists
- See if it actually contains data (vacuum could mark completely empty pages as invalid to speed that up)
Additionally, we'd need to make sure that nothing else could write to a page > virtual end of relation.
Once that's set, vacuum could start at virtual_end_of_relation and scan forward, ensuring no later pages have data. After that's done it can truncate anything > virtual_end (which might have moved forward in the meantime).
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net