heap vacuum & cleanup locks
We've occasionally seen problems with VACUUM getting stuck for failure
to acquire a cleanup lock due to, for example, a cursor holding a pin
on the buffer page. In the worst case, this can cause an undetected
deadlock, if the backend holding the buffer pin blocks trying to
acquire a heavyweight lock that is in turn blocked by VACUUM. A while
back, someone (Greg Stark? me?) floated the idea of not waiting for
the cleanup lock. If we can't get it immediately, or within some
short period of time, then we just skip the page and continue on.
Today I had what might be a better idea: don't try to acquire a
cleanup lock at all. Instead, acquire an exclusive lock. After
having done so, observe the pin count. If there are no other buffer
pins, that means our exclusive lock is actually a cleanup lock, and we
proceed as now. If other buffer pins do exist, then we can't
defragment the page, but that doesn't mean no useful work can be done:
we can still mark used line pointers dead, or dead line pointers
unused. We cannot defragment, but that can be done either by the next
VACUUM or by a HOT cleanup. We can even arrange - using existing
mechanism - to leave behind a hint that the page is a good candidate
for a HOT cleanup, by setting pd_prune_xid to, say, FrozenXID.
Like the idea of skipping pages on which we can't acquire a cleanup
lock altogether, this should prevent VACUUM from getting stuck trying
to lock a heap page. While buffer pins can be held for extended
periods of time, I don't think there is any operation that holds a
buffer content lock more than very briefly. Furthermore, unlike the
idea of skipping the page altogether, we could use this approach even
during an anti-wraparound vacuum.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jun 5, 2011 at 12:03, Robert Haas <robertmhaas@gmail.com> wrote:
If other buffer pins do exist, then we can't
defragment the page, but that doesn't mean no useful work can be done:
we can still mark used line pointers dead, or dead line pointers
unused. We cannot defragment, but that can be done either by the next
VACUUM or by a HOT cleanup.
This is just an idea -- Is it possible to have copy-on-write techniques?
VACUUM allocates a duplicated page for the pinned page, and copy valid
tuples into the new page. Following buffer readers after the VACUUM will
see the cloned page instead of the old pinned one.
Of course, copy-on-writing is more complex than skipping pinned pages,
but I wonder we cannot vacuum at all in some edge cases with the
skipping method.
--
Itagaki Takahiro
On Mon, Jun 6, 2011 at 12:19 AM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
On Sun, Jun 5, 2011 at 12:03, Robert Haas <robertmhaas@gmail.com> wrote:
If other buffer pins do exist, then we can't
defragment the page, but that doesn't mean no useful work can be done:
we can still mark used line pointers dead, or dead line pointers
unused. We cannot defragment, but that can be done either by the next
VACUUM or by a HOT cleanup.This is just an idea -- Is it possible to have copy-on-write techniques?
VACUUM allocates a duplicated page for the pinned page, and copy valid
tuples into the new page. Following buffer readers after the VACUUM will
see the cloned page instead of the old pinned one.
Heikki suggested the same thing, and it's not a bad idea, but I think
it would be more work to implement than what I proposed. The caller
would need to be aware that, if it tries to re-acquire a content lock
on the same page, the offset of the tuple within the page might
change. I'm not sure how much work would be required to cope with
that possibility.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jun 6, 2011, at 1:00 AM, Robert Haas wrote:
On Mon, Jun 6, 2011 at 12:19 AM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:On Sun, Jun 5, 2011 at 12:03, Robert Haas <robertmhaas@gmail.com> wrote:
If other buffer pins do exist, then we can't
defragment the page, but that doesn't mean no useful work can be done:
we can still mark used line pointers dead, or dead line pointers
unused. We cannot defragment, but that can be done either by the next
VACUUM or by a HOT cleanup.This is just an idea -- Is it possible to have copy-on-write techniques?
VACUUM allocates a duplicated page for the pinned page, and copy valid
tuples into the new page. Following buffer readers after the VACUUM will
see the cloned page instead of the old pinned one.Heikki suggested the same thing, and it's not a bad idea, but I think
it would be more work to implement than what I proposed. The caller
would need to be aware that, if it tries to re-acquire a content lock
on the same page, the offset of the tuple within the page might
change. I'm not sure how much work would be required to cope with
that possibility.
I've had a related idea that I haven't looked into... if you're scanning a relation (ie: index scan, seq scan) I've wondered if it would be more efficient to deal with the entire page at once, possibly be making a copy of it. This would reduce the number of times you pin the page (often quite dramatically). I realize that means copying the entire page, but I suspect that would occur entirely in the L1 cache, which would be fast.
So perhaps instead of copy on write we should try for copy on read on all appropriate plan nodes.
On a related note, I've also wondered if it would be useful to allow nodes to deal with more than one tuple at a time; the idea being that it's better to execute a smaller chunk of code over a bigger chunk of data instead of dribbling tuples through an entire execution tree one at a time. Perhaps that will only be useful if nodes are executing in parallel...
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On Sun, Jun 5, 2011 at 8:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
We've occasionally seen problems with VACUUM getting stuck for failure
to acquire a cleanup lock due to, for example, a cursor holding a pin
on the buffer page. In the worst case, this can cause an undetected
deadlock, if the backend holding the buffer pin blocks trying to
acquire a heavyweight lock that is in turn blocked by VACUUM. A while
back, someone (Greg Stark? me?) floated the idea of not waiting for
the cleanup lock. If we can't get it immediately, or within some
short period of time, then we just skip the page and continue on.
Do we know if this is really a problem though ? The deadlock for
example, can happen only when a backend tries to get a table level
conflicting lock while holding the buffer pin and I am not sure if we
do that.
The contention issue would probably make sense for small tables
because for large to very large tables, the probability that a backend
and vacuum would process the same page would be quite low. With the
current default for vac_threshold, the small tables can get vacuumed
very frequently and if they are also heavily accessed, the cleanup
lock can become a bottleneck.
Another issue that might be worth paying attention to is the single
pass vacuum that I am currently working on. The design that we agreed
up on, assumes that the index vacuum must clear index pointers to all
the dead line pointers. If we skip any page, we must at least collect
the existing dead line pointers and remove those index pointers. If we
create dead line pointers and we want to vacuum them later, we store
the LSN in the page and that may require defrag. Of course, we can
work around that, but I think it will be useful if we do some tests to
show that the cleanup lock is indeed a major bottleneck.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 06.06.2011 09:35, Jim Nasby wrote:
I've had a related idea that I haven't looked into... if you're scanning a relation (ie: index scan, seq scan) I've wondered if it would be more efficient to deal with the entire page at once, possibly be making a copy of it. This would reduce the number of times you pin the page (often quite dramatically). I realize that means copying the entire page, but I suspect that would occur entirely in the L1 cache, which would be fast.
We already do that. When an index scan moves to an index page, the heap
tid pointers of all the matching index tuples are copied to
backend-private memory in one go, and the lock is released. And for a
seqscan, the visibility of all the tuples on the page is checked in one
go while holding the lock, then the lock is released but the pin is
kept. The pin is only released after all the tuples have been read.
There's no repeated pin-unpin for each tuple.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Sun, Jun 5, 2011 at 4:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
We've occasionally seen problems with VACUUM getting stuck for failure
to acquire a cleanup lock due to, for example, a cursor holding a pin
on the buffer page. In the worst case, this can cause an undetected
deadlock, if the backend holding the buffer pin blocks trying to
acquire a heavyweight lock that is in turn blocked by VACUUM. A while
back, someone (Greg Stark? me?) floated the idea of not waiting for
the cleanup lock. If we can't get it immediately, or within some
short period of time, then we just skip the page and continue on.Today I had what might be a better idea: don't try to acquire a
cleanup lock at all. Instead, acquire an exclusive lock. After
having done so, observe the pin count. If there are no other buffer
pins, that means our exclusive lock is actually a cleanup lock, and we
proceed as now. If other buffer pins do exist, then we can't
defragment the page, but that doesn't mean no useful work can be done:
we can still mark used line pointers dead, or dead line pointers
unused. We cannot defragment, but that can be done either by the next
VACUUM or by a HOT cleanup. We can even arrange - using existing
mechanism - to leave behind a hint that the page is a good candidate
for a HOT cleanup, by setting pd_prune_xid to, say, FrozenXID.Like the idea of skipping pages on which we can't acquire a cleanup
lock altogether, this should prevent VACUUM from getting stuck trying
to lock a heap page. While buffer pins can be held for extended
periods of time, I don't think there is any operation that holds a
buffer content lock more than very briefly. Furthermore, unlike the
idea of skipping the page altogether, we could use this approach even
during an anti-wraparound vacuum.Thoughts?
Not waiting seems like a good idea.
Not returning to the block while it is in RAM or not cleaning the
block at all would cause a different performance issues, which I would
wish to avoid.
Hot Standby has specific code to avoid this situation. Perhaps you
could copy that, not sure.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 6, 2011 at 2:36 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
Do we know if this is really a problem though ? The deadlock for
example, can happen only when a backend tries to get a table level
conflicting lock while holding the buffer pin and I am not sure if we
do that.
The deadlock isn't terribly common, because, as you say, you need the
process holding the buffer pin to try to take a lock on the relation
being vacuumed that is strong enough to conflict with
ShareUpdateExclusiveLock. That's a slightly unusual thing to do.
But the problem of vacuum stalling out because it can't get the
cleanup lock is a very real one. I've seen at least one customer hit
this in production, and it was pretty painful. Now, granted, you need
some bad application design, too: you have to leave a cursor lying
around instead of running it to completion and then stopping. But
supposing you do make that mistake, you might hope that it wouldn't
cause VACUUM starvation, which is what happens today. IOW, I'm less
worried about whether the cleanup lock is slowing vacuum down than I
am about eliminating the pathological cases where an autovacuum
workers gets pinned down, stuck waiting for a cleanup lock that never
arrives. Now the table doesn't get vacuumed (bad) and the system as a
whole is one AV worker short of what it's supposed to have (also bad).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of lun jun 06 08:06:10 -0400 2011:
But the problem of vacuum stalling out because it can't get the
cleanup lock is a very real one. I've seen at least one customer hit
this in production, and it was pretty painful. Now, granted, you need
some bad application design, too: you have to leave a cursor lying
around instead of running it to completion and then stopping. But
supposing you do make that mistake, you might hope that it wouldn't
cause VACUUM starvation, which is what happens today. IOW, I'm less
worried about whether the cleanup lock is slowing vacuum down than I
am about eliminating the pathological cases where an autovacuum
workers gets pinned down, stuck waiting for a cleanup lock that never
arrives. Now the table doesn't get vacuumed (bad) and the system as a
whole is one AV worker short of what it's supposed to have (also bad).
One of the good things about your proposal is that (AFAICS) you can
freeze tuples without the cleanup lock, so the antiwraparound cleanup
would still work.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, Jun 6, 2011 at 12:49 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Robert Haas's message of lun jun 06 08:06:10 -0400 2011:
But the problem of vacuum stalling out because it can't get the
cleanup lock is a very real one. I've seen at least one customer hit
this in production, and it was pretty painful. Now, granted, you need
some bad application design, too: you have to leave a cursor lying
around instead of running it to completion and then stopping. But
supposing you do make that mistake, you might hope that it wouldn't
cause VACUUM starvation, which is what happens today. IOW, I'm less
worried about whether the cleanup lock is slowing vacuum down than I
am about eliminating the pathological cases where an autovacuum
workers gets pinned down, stuck waiting for a cleanup lock that never
arrives. Now the table doesn't get vacuumed (bad) and the system as a
whole is one AV worker short of what it's supposed to have (also bad).One of the good things about your proposal is that (AFAICS) you can
freeze tuples without the cleanup lock, so the antiwraparound cleanup
would still work.
Yeah, I think that's a major selling point. VACUUM getting stuck is
Bad. Anti-wraparound VACUUM getting stuck is Really Bad.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jun 6, 2011 at 8:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 06.06.2011 09:35, Jim Nasby wrote:
I've had a related idea that I haven't looked into... if you're scanning a
relation (ie: index scan, seq scan) I've wondered if it would be more
efficient to deal with the entire page at once, possibly be making a copy of
it. This would reduce the number of times you pin the page (often quite
dramatically). I realize that means copying the entire page, but I suspect
that would occur entirely in the L1 cache, which would be fast.We already do that. When an index scan moves to an index page, the heap tid
pointers of all the matching index tuples are copied to backend-private
memory in one go, and the lock is released. And for a seqscan, the
visibility of all the tuples on the page is checked in one go while holding
the lock, then the lock is released but the pin is kept. The pin is only
released after all the tuples have been read. There's no repeated pin-unpin
for each tuple.
But I think you've hit the important point here. The problem is not
whether VACUUM waits for the pin, its that the pins can be held for
extended periods.
It makes more sense to try to limit pin hold times than it does to
come up with pin avoidance techniques.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 6, 2011 at 11:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
But I think you've hit the important point here. The problem is not
whether VACUUM waits for the pin, its that the pins can be held for
extended periods.
Yes
It makes more sense to try to limit pin hold times than it does to
come up with pin avoidance techniques.
Well it's super-exclusive-vacuum-lock avoidance techniques. Why
shouldn't it make more sense to try to reduce the frequency and impact
of the single-purpose outlier in a non-critical-path instead of
burdening every other data reader with extra overhead?
I think Robert's plan is exactly right though I would phrase it
differently. We should get the exclusive lock, freeze/kill any xids
and line pointers, then if the pin-count is 1 do the compaction.
I'm really wishing we had more bits in the vm. It looks like we could use:
- contains not-all-visible tuples
- contains not-frozen xids
- in need of compaction
I'm sure we could find a use for one more page-level vm bit too.
--
greg
On Tue, Jun 7, 2011 at 8:24 PM, Greg Stark <gsstark@mit.edu> wrote:
On Mon, Jun 6, 2011 at 11:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
But I think you've hit the important point here. The problem is not
whether VACUUM waits for the pin, its that the pins can be held for
extended periods.Yes
It makes more sense to try to limit pin hold times than it does to
come up with pin avoidance techniques.Well it's super-exclusive-vacuum-lock avoidance techniques. Why
shouldn't it make more sense to try to reduce the frequency and impact
of the single-purpose outlier in a non-critical-path instead of
burdening every other data reader with extra overhead?I think Robert's plan is exactly right though I would phrase it
differently. We should get the exclusive lock, freeze/kill any xids
and line pointers, then if the pin-count is 1 do the compaction.
Would that also be possible during recovery?
A similar problem exists with Hot Standby, so I'm worried fixing just
VACUUMs would be a kluge.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 7, 2011 at 3:43 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Jun 7, 2011 at 8:24 PM, Greg Stark <gsstark@mit.edu> wrote:
On Mon, Jun 6, 2011 at 11:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
But I think you've hit the important point here. The problem is not
whether VACUUM waits for the pin, its that the pins can be held for
extended periods.Yes
It makes more sense to try to limit pin hold times than it does to
come up with pin avoidance techniques.Well it's super-exclusive-vacuum-lock avoidance techniques. Why
shouldn't it make more sense to try to reduce the frequency and impact
of the single-purpose outlier in a non-critical-path instead of
burdening every other data reader with extra overhead?I think Robert's plan is exactly right though I would phrase it
differently. We should get the exclusive lock, freeze/kill any xids
and line pointers, then if the pin-count is 1 do the compaction.Would that also be possible during recovery?
A similar problem exists with Hot Standby, so I'm worried fixing just
VACUUMs would be a kluge.
We have to do the same operation on both the master and standby, so if
the master decides to skip the compaction then the slave will skip it
as well (and need not worry about waiting for pin-count 1). But if
the master does the compaction then the slave will have to get a
matching cleanup lock, just as now.
Your idea of somehow adjusting things so that we don't hold the buffer
pin for a long period of time would be better in that regard, but I'm
not sure how to do it. Presumably we could rejigger things to copy
the tuples instead of holding a pin, but that would carry a
performance penalty for the (very common) case where there is no
conflict with VACUUM.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 7, 2011 at 3:24 PM, Greg Stark <gsstark@mit.edu> wrote:
Well it's super-exclusive-vacuum-lock avoidance techniques. Why
shouldn't it make more sense to try to reduce the frequency and impact
of the single-purpose outlier in a non-critical-path instead of
burdening every other data reader with extra overhead?I think Robert's plan is exactly right though I would phrase it
differently. We should get the exclusive lock, freeze/kill any xids
and line pointers, then if the pin-count is 1 do the compaction.
I wrote a really neat patch to do this today... and then, as I
thought about it some more, I started to think that it's probably
unsafe. Here's the trouble: with this approach, we assume that it's
OK to change the contents of the line pointer while holding only an
exclusive lock on the buffer. But there is a good deal of code out
there that thinks it's OK to examine a line pointer with only a pin on
the buffer (no lock). You need a content lock to scan the item
pointer array, but once you've identified an item of interest, you're
entitled to assume that it won't be modified while you hold a buffer
pin. Now, if you've identified a particular tuple as being visible to
your scan, then you might think that VACUUM shouldn't be removing it
anyway. But I think that's only true for MVCC scans - for example,
what happens under SnapshotNow semantics? But then then on third
thought, if you've also got an MVCC snapshot taken before the start of
the SnapshotNow scan, you are probably OK, because your advertised
xmin should prevent anyone from removing anything anyway, so how do
you actually provoke a failure?
Anyway, I'm attaching the patch, in case anyone has any ideas on where
to go with this.
I'm really wishing we had more bits in the vm. It looks like we could use:
- contains not-all-visible tuples
- contains not-frozen xids
- in need of compactionI'm sure we could find a use for one more page-level vm bit too.
We've got plenty of room for more page-level bits, if we need them.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
vacuum-unstick-v1.patchapplication/octet-stream; name=vacuum-unstick-v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b2d1901..b9a057e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3992,7 +3992,7 @@ log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
- TransactionId latestRemovedXid)
+ TransactionId latestRemovedXid, bool defragment)
{
xl_heap_clean xlrec;
uint8 info;
@@ -4005,6 +4005,7 @@ log_heap_clean(Relation reln, Buffer buffer,
xlrec.node = reln->rd_node;
xlrec.block = BufferGetBlockNumber(buffer);
xlrec.latestRemovedXid = latestRemovedXid;
+ xlrec.flags = defragment ? HEAP_CLEAN_DEFRAGMENT : 0;
xlrec.nredirected = nredirected;
xlrec.ndead = ndead;
@@ -4308,6 +4309,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
int ndead;
int nunused;
Size freespace;
+ bool defragment;
/*
* We're about to remove tuples. In Hot Standby mode, ensure that there's
@@ -4325,11 +4327,15 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
if (record->xl_info & XLR_BKP_BLOCK_1)
return;
+ defragment = (xlrec->flags & HEAP_CLEAN_DEFRAGMENT) != 0;
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
if (!BufferIsValid(buffer))
return;
- LockBufferForCleanup(buffer);
+ if (defragment)
+ LockBufferForCleanup(buffer);
+ else
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
if (XLByteLE(lsn, PageGetLSN(page)))
@@ -4351,7 +4357,8 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
heap_page_prune_execute(buffer,
redirected, nredirected,
nowdead, ndead,
- nowunused, nunused);
+ nowunused, nunused,
+ defragment);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 61f2ce4..f3fecce 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -108,15 +108,19 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
{
- /* OK, try to get exclusive buffer lock */
+ /*
+ * Try to get buffer cleanup lock. There's little point in pruning
+ * the page unless we can also defragment.
+ */
if (!ConditionalLockBufferForCleanup(buffer))
return;
/*
* Now that we have buffer lock, get accurate information about the
* page's free space, and recheck the heuristic about whether to
- * prune. (We needn't recheck PageIsPrunable, since no one else could
- * have pruned while we hold pin.)
+ * prune. (We don't recheck PageIsPrunable(); if vacuum cleaned up
+ * the page despite our pin, it will necessarily have skipped
+ * the defrag step, so we may as well do it now.)
*/
if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
{
@@ -124,7 +128,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
* needed */
/* OK to prune */
- (void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore);
+ (void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore,
+ true);
}
/* And release buffer lock */
@@ -134,9 +139,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
/*
- * Prune and repair fragmentation in the specified page.
+ * Prune and (optionally) repair fragmentation in the specified page.
*
- * Caller must have pin and buffer cleanup lock on the page.
+ * Caller must have pin and exclusive lock on the page. To defragment, a
+ * cleanup lock is required.
*
* OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
* or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
@@ -151,7 +157,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
*/
int
heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
- bool report_stats, TransactionId *latestRemovedXid)
+ bool report_stats, TransactionId *latestRemovedXid,
+ bool defragment)
{
int ndeleted = 0;
Page page = BufferGetPage(buffer);
@@ -201,8 +208,9 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
/* Any error while applying the changes is critical */
START_CRIT_SECTION();
- /* Have we found any prunable items? */
- if (prstate.nredirected > 0 || prstate.ndead > 0 || prstate.nunused > 0)
+ /* Have we found any prunable items (or do we need to defrag anyway)? */
+ if (prstate.nredirected > 0 || prstate.ndead > 0 || prstate.nunused > 0
+ || (defragment && PageNeedsDefrag(page)))
{
/*
* Apply the planned item changes, then repair page fragmentation, and
@@ -211,7 +219,8 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
prstate.nowdead, prstate.ndead,
- prstate.nowunused, prstate.nunused);
+ prstate.nowunused, prstate.nunused,
+ defragment);
/*
* Update the page's pd_prune_xid field to either zero, or the lowest
@@ -220,11 +229,21 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
/*
- * Also clear the "page is full" flag, since there's no point in
- * repeating the prune/defrag process until something else happens to
- * the page.
+ * If we were able to defragment, also clear the "page is full" flag,
+ * since there's no point in repeating this work something else happens
+ * to the page; also clear the "needs defrag" flag, if set.
+ *
+ * If we were not able to defragment, set the "needs defrag" flag so
+ * that the next vacuum will try to clean it up even if no new dead
+ * tuples have been created.
*/
- PageClearFull(page);
+ if (defragment)
+ {
+ PageClearFull(page);
+ PageClearNeedsDefrag(page);
+ }
+ else
+ PageSetNeedsDefrag(page);
MarkBufferDirty(buffer);
@@ -239,7 +258,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
prstate.redirected, prstate.nredirected,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused,
- prstate.latestRemovedXid);
+ prstate.latestRemovedXid, defragment);
PageSetLSN(BufferGetPage(buffer), recptr);
PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
@@ -252,15 +271,15 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
* pd_prune_xid field, update it and mark the buffer dirty. This is
* treated as a non-WAL-logged hint.
*
- * Also clear the "page is full" flag if it is set, since there's no
- * point in repeating the prune/defrag process until something else
- * happens to the page.
+ * Also clear the "page is full" flag, since there's no point in
+ * repeating this work something else happens to the page.
*/
if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
PageIsFull(page))
{
((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
- PageClearFull(page);
+ if (defragment)
+ PageClearFull(page);
SetBufferCommitInfoNeedsSave(buffer);
}
}
@@ -643,7 +662,7 @@ void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
- OffsetNumber *nowunused, int nunused)
+ OffsetNumber *nowunused, int nunused, bool defragment)
{
Page page = (Page) BufferGetPage(buffer);
OffsetNumber *offnum;
@@ -684,7 +703,8 @@ heap_page_prune_execute(Buffer buffer,
* Finally, repair any fragmentation, and update the page's hint bit about
* whether it has free pointers.
*/
- PageRepairFragmentation(page);
+ if (defragment)
+ PageRepairFragmentation(page);
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b197b45..c794dc2 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -124,7 +124,7 @@ static void lazy_cleanup_index(Relation indrel,
IndexBulkDeleteResult *stats,
LVRelStats *vacrelstats);
static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats);
+ int tupindex, LVRelStats *vacrelstats, bool defragment);
static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
static BlockNumber count_nondeletable_pages(Relation onerel,
LVRelStats *vacrelstats);
@@ -418,6 +418,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool all_visible_according_to_vm;
bool all_visible;
bool has_dead_tuples;
+ bool defragment;
if (blkno == next_not_all_visible_block)
{
@@ -485,8 +486,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
- /* We need buffer cleanup lock so that we can prune HOT chains. */
- LockBufferForCleanup(buf);
+ /*
+ * If we can get a buffer cleanup lock, we'll prune and defragment
+ * the page. But if someone's holding a pin, we don't want to get
+ * stuck waiting for it, so we'll just prune and leave defragmentation
+ * for another time.
+ */
+ defragment = LockBufferForPossibleCleanup(buf);
page = BufferGetPage(buf);
@@ -567,7 +573,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* We count tuples removed by the pruning step as removed by VACUUM.
*/
tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
- &vacrelstats->latestRemovedXid);
+ &vacrelstats->latestRemovedXid,
+ defragment);
/*
* Now scan the page to collect vacuumable items and check for tuples
@@ -759,7 +766,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacrelstats->num_dead_tuples > 0)
{
/* Remove tuples from heap */
- lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats);
+ lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, defragment);
/*
* Forget the now-vacuumed tuples, and press on, but be careful
@@ -772,8 +779,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* Update the all-visible flag on the page */
- if (!PageIsAllVisible(page) && all_visible)
+ /*
+ * Update the all-visible flag on the page. We skip this if we weren't
+ * able to defragment, so that the next vacuum will (hopefully) do so.
+ */
+ if (!PageIsAllVisible(page) && all_visible && defragment)
{
PageSetAllVisible(page);
SetBufferCommitInfoNeedsSave(buf);
@@ -926,21 +936,27 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
Buffer buf;
Page page;
Size freespace;
+ bool defragment;
vacuum_delay_point();
tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vac_strategy);
- LockBufferForCleanup(buf);
- tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats);
+ defragment = LockBufferForPossibleCleanup(buf);
+ tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
+ defragment);
- /* Now that we've compacted the page, record its available space */
- page = BufferGetPage(buf);
- freespace = PageGetHeapFreeSpace(page);
+ /* If we've compacted the page, record its available space */
+ if (defragment)
+ {
+ page = BufferGetPage(buf);
+ freespace = PageGetHeapFreeSpace(page);
+ }
UnlockReleaseBuffer(buf);
- RecordPageWithFreeSpace(onerel, tblk, freespace);
+ if (defragment)
+ RecordPageWithFreeSpace(onerel, tblk, freespace);
npages++;
}
@@ -964,7 +980,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
*/
static int
lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats)
+ int tupindex, LVRelStats *vacrelstats, bool defragment)
{
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxOffsetNumber];
@@ -987,7 +1003,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
unused[uncnt++] = toff;
}
- PageRepairFragmentation(page);
+ if (defragment)
+ PageRepairFragmentation(page);
MarkBufferDirty(buffer);
@@ -999,7 +1016,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
recptr = log_heap_clean(onerel, buffer,
NULL, 0, NULL, 0,
unused, uncnt,
- vacrelstats->latestRemovedXid);
+ vacrelstats->latestRemovedXid, defragment);
PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e59af33..093f4a3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2564,6 +2564,50 @@ ConditionalLockBufferForCleanup(Buffer buffer)
return false;
}
+/*
+ * LockBufferForPossibleCleanup
+ *
+ * Exclusive lock the target buffer. If the pin count happens to be exactly
+ * one, then return true, indicating that we've acquired a cleanup lock. If
+ * not, return false, but retain the buffer lock.
+ */
+bool
+LockBufferForPossibleCleanup(Buffer buffer)
+{
+ volatile BufferDesc *bufHdr;
+ bool got_cleanup_lock;
+
+ Assert(BufferIsValid(buffer));
+ Assert(PinCountWaitBuf == NULL);
+
+ if (BufferIsLocal(buffer))
+ {
+ /* There should be exactly one pin */
+ if (LocalRefCount[-buffer - 1] != 1)
+ elog(ERROR, "incorrect local pin count: %d",
+ LocalRefCount[-buffer - 1]);
+ /* Nobody else to wait for */
+ return true;
+ }
+
+ /* There should be exactly one local pin */
+ if (PrivateRefCount[buffer - 1] != 1)
+ elog(ERROR, "incorrect local pin count: %d",
+ PrivateRefCount[buffer - 1]);
+
+ bufHdr = &BufferDescriptors[buffer - 1];
+
+ /* Get buffer lock. */
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Check whether it's a cleanup lock. */
+ LockBufHdr(bufHdr);
+ Assert(bufHdr->refcount > 0);
+ got_cleanup_lock = (bufHdr->refcount == 1);
+ UnlockBufHdr(bufHdr);
+
+ return got_cleanup_lock;
+}
/*
* Functions for buffer I/O handling
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 776ea5c..3a9417d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -133,7 +133,7 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused,
- TransactionId latestRemovedXid);
+ TransactionId latestRemovedXid, bool defragment);
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid,
OffsetNumber *offsets, int offcnt);
@@ -147,11 +147,13 @@ extern void heap_page_prune_opt(Relation relation, Buffer buffer,
TransactionId OldestXmin);
extern int heap_page_prune(Relation relation, Buffer buffer,
TransactionId OldestXmin,
- bool report_stats, TransactionId *latestRemovedXid);
+ bool report_stats, TransactionId *latestRemovedXid,
+ bool defragment);
extern void heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
- OffsetNumber *nowunused, int nunused);
+ OffsetNumber *nowunused, int nunused,
+ bool defragment);
extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
/* in heap/syncscan.c */
diff --git a/src/include/access/htup.h b/src/include/access/htup.h
index 966e2d0..0854ca6 100644
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@@ -689,6 +689,7 @@ typedef struct xl_heap_clean
RelFileNode node;
BlockNumber block;
TransactionId latestRemovedXid;
+ uint16 flags;
uint16 nredirected;
uint16 ndead;
/* OFFSET NUMBERS FOLLOW */
@@ -696,6 +697,9 @@ typedef struct xl_heap_clean
#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define HEAP_CLEAN_DEFRAGMENT 0x0001 /* defragment page during cleanup */
+
+
/*
* Cleanup_info is required in some cases during a lazy VACUUM.
* Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 49b5d31..ab568ca 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -208,6 +208,7 @@ extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool LockBufferForPossibleCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 42d6b10..2d4c721 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -154,8 +154,9 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_NEEDS_DEFRAG 0x0008 /* page pruned without defrag */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000f /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -345,6 +346,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageNeedsDefrag(page) \
+ (((PageHeader) (page))->pd_flags & PD_NEEDS_DEFRAG)
+#define PageSetNeedsDefrag(page) \
+ (((PageHeader) (page))->pd_flags |= PD_NEEDS_DEFRAG)
+#define PageClearNeedsDefrag(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_NEEDS_DEFRAG)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On Sun, Jun 5, 2011 at 4:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
We've occasionally seen problems with VACUUM getting stuck for failure
to acquire a cleanup lock due to, for example, a cursor holding a pin
on the buffer page. In the worst case, this can cause an undetected
deadlock, if the backend holding the buffer pin blocks trying to
acquire a heavyweight lock that is in turn blocked by VACUUM.
Those deadlocks can be detected in exactly the same way as is used for
Hot Standby.
Cleanup waiter registers interest in pin, anyone with a lock request
that must wait checks to see if they hold a pin that would cause
deadlock.
I'll look at doing a patch for that. Shouldn't take long.
A while
back, someone (Greg Stark? me?) floated the idea of not waiting for
the cleanup lock. If we can't get it immediately, or within some
short period of time, then we just skip the page and continue on.
Separately, that sounds like a great idea and it's simple to implement
- patch attached.
Enhancements to that are that I don't see any particular reason why
the heap pages need to be vacuumed in exactly sequential order. If
they are on disk, reading sequentially is useful, in which case nobody
has a pin and so we will continue. But if the blocks are already in
shared_buffers, then the sequential order doesn't matter at all. So we
could skip pages and then return to them later on.
Also, ISTM that LockBufferForCleanup() waits for just a single buffer,
but it could equally well wait for multiple buffers at the same time.
By this, we would then be able to simply register our interest in
multiple buffers and get woken as soon as one of them were free. That
way we could read the blocks sequentially, but lock and clean them out
of sequence if necessary. Do this in chunks, so it plays nicely with
buffer strategy. (Patch doesn't do that yet).
(Not sure if the patch handles vacuum map correctly if we skip the
page, but its a reasonable prototype for discussion).
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
vacuum_skip_busy_pages.v1.patchapplication/octet-stream; name=vacuum_skip_busy_pages.v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b2d1901..81422af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3841,6 +3841,44 @@ recheck_xvac:
return changed;
}
+/*
+ * heap_tuple_needs_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * are older than the specified cutoff XID. If so, return TRUE.
+ *
+ * It doesn't matter whether the tuple is alive or dead, we are checking
+ * to see if a tuple needs to be removed or frozen to avoid wraparound.
+ */
+bool
+heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ Buffer buf)
+{
+ TransactionId xid;
+
+ xid = HeapTupleHeaderGetXmin(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+
+ if (!(tuple->t_infomask & HEAP_XMAX_IS_MULTI))
+ {
+ xid = HeapTupleHeaderGetXmax(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+ }
+
+ if (tuple->t_infomask & HEAP_MOVED)
+ {
+ xid = HeapTupleHeaderGetXvac(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+ }
+
+ return false;
+}
/* ----------------
* heap_markpos - mark scan position
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f42504c..d08ac65 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1054,7 +1054,7 @@ vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound)
vacstmt->freeze_min_age, vacstmt->freeze_table_age);
}
else
- lazy_vacuum_rel(onerel, vacstmt, vac_strategy);
+ lazy_vacuum_rel(onerel, vacstmt, vac_strategy, for_wraparound);
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b197b45..dd1ebcd 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -115,8 +115,9 @@ static BufferAccessStrategy vac_strategy;
/* non-export function prototypes */
static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all);
+ Relation *Irel, int nindexes, bool scan_all, bool for_wraparound);
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
+static bool lazy_scan_page_for_wraparound(Buffer buf);
static void lazy_vacuum_index(Relation indrel,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats);
@@ -146,7 +147,7 @@ static int vac_cmp_itemptr(const void *left, const void *right);
*/
void
lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool for_wraparound)
{
LVRelStats *vacrelstats;
Relation *Irel;
@@ -193,7 +194,7 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
vacrelstats->hasindex = (nindexes > 0);
/* Do the vacuuming */
- lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);
+ lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all, for_wraparound);
/* Done with indexes */
vac_close_indexes(nindexes, Irel, NoLock);
@@ -327,7 +328,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
*/
static void
lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all)
+ Relation *Irel, int nindexes, bool scan_all, bool for_wraparound)
{
BlockNumber nblocks,
blkno;
@@ -486,7 +487,37 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RBM_NORMAL, vac_strategy);
/* We need buffer cleanup lock so that we can prune HOT chains. */
- LockBufferForCleanup(buf);
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ /*
+ * It's OK to skip vacuuming a page, as long as its not
+ * got data that needs to be cleaned for wraparound avoidance.
+ */
+ if (!for_wraparound)
+ {
+ ReleaseBuffer(buf);
+ continue;
+ }
+ else
+ {
+ /*
+ * If this is a wraparound checking vacuum, then we read
+ * the page with share lock to see if any xids need to be
+ * frozen. If the page doesn't need attention we just skip
+ * and continue. If it does, we wait for cleanup lock.
+ *
+ * We could defer the lock request further by remembering
+ * the page and coming back to it later, of we could even
+ * register ourselves for multiple buffers and then
+ * service whichever one is received first.
+ */
+ if (!lazy_scan_page_for_wraparound(buf))
+ continue;
+
+ LockBufferForCleanup(buf);
+ /* drop through to normal processing */
+ }
+ }
page = BufferGetPage(buf);
@@ -1010,6 +1041,60 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
}
/*
+ * lazy_scan_page_for_wraparound() -- scan page to see if any tuples
+ * need to be cleaned to avoid wraparound
+ *
+ * Caller must hold not buffer lock on the buffer.
+ *
+ * Returns true if the page needs to be vacuumed using cleanup lock.
+ */
+static bool
+lazy_scan_page_for_wraparound(Buffer buf)
+{
+ Page page;
+ OffsetNumber offnum,
+ maxoff;
+ HeapTupleHeader tupleheader;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page) || PageIsEmpty(page))
+ {
+ /* PageIsNew probably shouldn't happen... */
+ UnlockReleaseBuffer(buf);
+ return false;
+ }
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid;
+
+ itemid = PageGetItemId(page, offnum);
+
+ if (!ItemIdIsNormal(itemid))
+ continue;
+
+ tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+
+ if (heap_tuple_needs_freeze(tupleheader, FreezeLimit, buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ /* Don't release buffer, we want to continue using it */
+ return true;
+ }
+ } /* scan along page */
+
+ UnlockReleaseBuffer(buf);
+ return false;
+}
+
+
+/*
* lazy_vacuum_index() -- vacuum one index relation.
*
* Delete all the index entries pointing to tuples listed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 776ea5c..85cbeb3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -111,6 +111,8 @@ extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
Buffer buf);
+extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ Buffer buf);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d8fd0ca..60f21f5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -162,7 +162,7 @@ extern void vacuum_delay_point(void);
/* in commands/vacuumlazy.c */
extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool for_wraparound);
/* in commands/analyze.c */
extern void analyze_rel(Oid relid, VacuumStmt *vacstmt,
On Thu, Nov 3, 2011 at 7:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
A while
back, someone (Greg Stark? me?) floated the idea of not waiting for
the cleanup lock. If we can't get it immediately, or within some
short period of time, then we just skip the page and continue on.Separately, that sounds like a great idea and it's simple to implement
- patch attached.
Oh, that's kind of clever. I was thinking that you'd have to disable
this entirely for anti-wraparound vacuum, but the way you've done it
avoids that. You'll still have to wait if there's a competing pin on
a buffer that contains tuples actually in need of freezing, but that
should be relatively rare.
Enhancements to that are that I don't see any particular reason why
Also, ISTM that LockBufferForCleanup() waits for just a single buffer,
but it could equally well wait for multiple buffers at the same time.
By this, we would then be able to simply register our interest in
multiple buffers and get woken as soon as one of them were free. That
way we could read the blocks sequentially, but lock and clean them out
of sequence if necessary. Do this in chunks, so it plays nicely with
buffer strategy. (Patch doesn't do that yet).
I doubt this would help much. The real issue is with open cursors,
and those can easily be left open for long enough that those
optimizations won't help. I think the patch as it stands is probably
gets just about all of the benefit that can be had from this approach
while still being reasonably simple.
(Not sure if the patch handles vacuum map correctly if we skip the
page, but its a reasonable prototype for discussion).
Yeah. I think that should be OK, but:
- It looks to me like you haven't done anything about the second heap
pass. That should probably get a similar fix.
- I think that this is going to screw up the reltuples calculation
unless we fudge it somehow. The number of scanned pages has already
been incremented by the time your new code is reached.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 3, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that should be OK, but:
- It looks to me like you haven't done anything about the second heap
pass. That should probably get a similar fix.
I was assuming this worked with Pavan's patch to remove second pass.
Not in any rush to commit this, so will wait till that is thru.
- I think that this is going to screw up the reltuples calculation
unless we fudge it somehow. The number of scanned pages has already
been incremented by the time your new code is reached.
Yeh, I'll have a look at that in more detail. Thanks for the review.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 3, 2011 at 9:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Nov 3, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that should be OK, but:
- It looks to me like you haven't done anything about the second heap
pass. That should probably get a similar fix.I was assuming this worked with Pavan's patch to remove second pass.
It's not entirely certain that will make it into 9.2, so I would
rather get this done first. If you want I can pick up what you've
done and send you back a version that addresses this.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Nov 3, 2011 at 2:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Nov 3, 2011 at 9:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Nov 3, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that should be OK, but:
- It looks to me like you haven't done anything about the second heap
pass. That should probably get a similar fix.I was assuming this worked with Pavan's patch to remove second pass.
It's not entirely certain that will make it into 9.2, so I would
rather get this done first. If you want I can pick up what you've
done and send you back a version that addresses this.
OK, that seems efficient. Thanks.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 3, 2011 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Nov 3, 2011 at 2:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Nov 3, 2011 at 9:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Nov 3, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that should be OK, but:
- It looks to me like you haven't done anything about the second heap
pass. That should probably get a similar fix.I was assuming this worked with Pavan's patch to remove second pass.
It's not entirely certain that will make it into 9.2, so I would
rather get this done first. If you want I can pick up what you've
done and send you back a version that addresses this.OK, that seems efficient. Thanks.
Here's a new version. I fixed the second pass as discussed (which
turned out to be trivial). To address the concern about relpages, I
moved this pre-existing line to after we get the buffer lock:
+ vacrelstats->scanned_pages++;
That appears to do the right thing.
I found it kind of confusing that lazy_scan_page_for_wraparound()
returns with the pin either held or not held depending on the return
value, so I rearranged things very slightly so that it doesn't need to
do that. I'm wondering whether we should rename that function to
something like lazy_check_needs_freeze().
I tested this out and discovered that "VACUUM FREEZE" doesn't set the
for_wraparound flag. On further review, I think that we should
probably conditionalize the behavior on the scan_all flag that
lazy_vacuum_rel sets, rather than for_wraparound. Otherwise, there's
no way for the user to manually force relfrozenxid to advance, which
doesn't seem good. I haven't made that change in this version,
though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
vacuum_skip_busy_pages.v2.patchapplication/octet-stream; name=vacuum_skip_busy_pages.v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b2d1901..81422af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3841,6 +3841,44 @@ recheck_xvac:
return changed;
}
+/*
+ * heap_tuple_needs_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * are older than the specified cutoff XID. If so, return TRUE.
+ *
+ * It doesn't matter whether the tuple is alive or dead, we are checking
+ * to see if a tuple needs to be removed or frozen to avoid wraparound.
+ */
+bool
+heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ Buffer buf)
+{
+ TransactionId xid;
+
+ xid = HeapTupleHeaderGetXmin(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+
+ if (!(tuple->t_infomask & HEAP_XMAX_IS_MULTI))
+ {
+ xid = HeapTupleHeaderGetXmax(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+ }
+
+ if (tuple->t_infomask & HEAP_MOVED)
+ {
+ xid = HeapTupleHeaderGetXvac(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+ }
+
+ return false;
+}
/* ----------------
* heap_markpos - mark scan position
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f42504c..d08ac65 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1054,7 +1054,7 @@ vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound)
vacstmt->freeze_min_age, vacstmt->freeze_table_age);
}
else
- lazy_vacuum_rel(onerel, vacstmt, vac_strategy);
+ lazy_vacuum_rel(onerel, vacstmt, vac_strategy, for_wraparound);
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b197b45..8a09a01 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -115,8 +115,10 @@ static BufferAccessStrategy vac_strategy;
/* non-export function prototypes */
static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all);
+ Relation *Irel, int nindexes, bool scan_all,
+ bool for_wraparound);
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
+static bool lazy_scan_page_for_wraparound(Buffer buf);
static void lazy_vacuum_index(Relation indrel,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats);
@@ -146,7 +148,7 @@ static int vac_cmp_itemptr(const void *left, const void *right);
*/
void
lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool for_wraparound)
{
LVRelStats *vacrelstats;
Relation *Irel;
@@ -193,7 +195,8 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
vacrelstats->hasindex = (nindexes > 0);
/* Do the vacuuming */
- lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);
+ lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all,
+ for_wraparound);
/* Done with indexes */
vac_close_indexes(nindexes, Irel, NoLock);
@@ -327,7 +330,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
*/
static void
lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all)
+ Relation *Irel, int nindexes, bool scan_all, bool for_wraparound)
{
BlockNumber nblocks,
blkno;
@@ -453,8 +456,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacuum_delay_point();
- vacrelstats->scanned_pages++;
-
/*
* If we are close to overrunning the available space for dead-tuple
* TIDs, pause and do a cycle of vacuuming before we tackle this page.
@@ -486,7 +487,41 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RBM_NORMAL, vac_strategy);
/* We need buffer cleanup lock so that we can prune HOT chains. */
- LockBufferForCleanup(buf);
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ /*
+ * It's OK to skip vacuuming a page, as long as its not got data
+ * that needs to be cleaned for wraparound avoidance.
+ */
+ if (!for_wraparound)
+ {
+ ReleaseBuffer(buf);
+ continue;
+ }
+
+ /*
+ * If this is a wraparound checking vacuum, then we read the page
+ * with share lock to see if any xids need to be frozen. If the
+ * page doesn't need attention we just skip and continue. If it
+ * does, we wait for cleanup lock.
+ *
+ * We could defer the lock request further by remembering the page
+ * and coming back to it later, of we could even register
+ * ourselves for multiple buffers and then service whichever one
+ * is received first. For now, this seems good enough.
+ */
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ if (!lazy_scan_page_for_wraparound(buf))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBufferForCleanup(buf);
+ /* drop through to normal processing */
+ }
+
+ vacrelstats->scanned_pages++;
page = BufferGetPage(buf);
@@ -932,7 +967,8 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vac_strategy);
- LockBufferForCleanup(buf);
+ if (!ConditionalLockBufferForCleanup(buf))
+ continue;
tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats);
/* Now that we've compacted the page, record its available space */
@@ -1010,6 +1046,50 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
}
/*
+ * lazy_scan_page_for_wraparound() -- scan page to see if any tuples
+ * need to be cleaned to avoid wraparound
+ *
+ * Returns true if the page needs to be vacuumed using cleanup lock.
+ */
+static bool
+lazy_scan_page_for_wraparound(Buffer buf)
+{
+ Page page;
+ OffsetNumber offnum,
+ maxoff;
+ HeapTupleHeader tupleheader;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page) || PageIsEmpty(page))
+ {
+ /* PageIsNew probably shouldn't happen... */
+ return false;
+ }
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid;
+
+ itemid = PageGetItemId(page, offnum);
+
+ if (!ItemIdIsNormal(itemid))
+ continue;
+
+ tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+
+ if (heap_tuple_needs_freeze(tupleheader, FreezeLimit, buf))
+ return true;
+ } /* scan along page */
+
+ return false;
+}
+
+
+/*
* lazy_vacuum_index() -- vacuum one index relation.
*
* Delete all the index entries pointing to tuples listed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 776ea5c..85cbeb3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -111,6 +111,8 @@ extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
Buffer buf);
+extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ Buffer buf);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d8fd0ca..60f21f5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -162,7 +162,7 @@ extern void vacuum_delay_point(void);
/* in commands/vacuumlazy.c */
extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool for_wraparound);
/* in commands/analyze.c */
extern void analyze_rel(Oid relid, VacuumStmt *vacstmt,
On Fri, Nov 4, 2011 at 6:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Here's a new version. I fixed the second pass as discussed (which
turned out to be trivial). To address the concern about relpages, I
moved this pre-existing line to after we get the buffer lock:+ vacrelstats->scanned_pages++;
That appears to do the right thing.
I think we need to count skipped pages also. I don't like the idea
that vacuum would just report less pages than there are in the table.
We'll just get requests to explain that.
I found it kind of confusing that lazy_scan_page_for_wraparound()
returns with the pin either held or not held depending on the return
value, so I rearranged things very slightly so that it doesn't need to
do that. I'm wondering whether we should rename that function to
something like lazy_check_needs_freeze().
OK
I tested this out and discovered that "VACUUM FREEZE" doesn't set the
for_wraparound flag. On further review, I think that we should
probably conditionalize the behavior on the scan_all flag that
lazy_vacuum_rel sets, rather than for_wraparound. Otherwise, there's
no way for the user to manually force relfrozenxid to advance, which
doesn't seem good. I haven't made that change in this version,
though.
Agreed, separate patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 4, 2011 at 3:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Nov 4, 2011 at 6:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Here's a new version. I fixed the second pass as discussed (which
turned out to be trivial). To address the concern about relpages, I
moved this pre-existing line to after we get the buffer lock:+ vacrelstats->scanned_pages++;
That appears to do the right thing.
I think we need to count skipped pages also. I don't like the idea
that vacuum would just report less pages than there are in the table.
We'll just get requests to explain that.
It's been skipping pages due to the visibility map since 8.4. This
seems like small potatoes by comparison, but we could add some
counters if you like.
I found it kind of confusing that lazy_scan_page_for_wraparound()
returns with the pin either held or not held depending on the return
value, so I rearranged things very slightly so that it doesn't need to
do that. I'm wondering whether we should rename that function to
something like lazy_check_needs_freeze().OK
I tested this out and discovered that "VACUUM FREEZE" doesn't set the
for_wraparound flag. On further review, I think that we should
probably conditionalize the behavior on the scan_all flag that
lazy_vacuum_rel sets, rather than for_wraparound. Otherwise, there's
no way for the user to manually force relfrozenxid to advance, which
doesn't seem good. I haven't made that change in this version,
though.Agreed, separate patch.
Doing that actually makes the patch simpler, so I went ahead and did
that in the attached version, along with the renaming mentioned above.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
vacuum_skip_busy_pages.v3.patchapplication/octet-stream; name=vacuum_skip_busy_pages.v3.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b2d1901..81422af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3841,6 +3841,44 @@ recheck_xvac:
return changed;
}
+/*
+ * heap_tuple_needs_freeze
+ *
+ * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
+ * are older than the specified cutoff XID. If so, return TRUE.
+ *
+ * It doesn't matter whether the tuple is alive or dead, we are checking
+ * to see if a tuple needs to be removed or frozen to avoid wraparound.
+ */
+bool
+heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ Buffer buf)
+{
+ TransactionId xid;
+
+ xid = HeapTupleHeaderGetXmin(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+
+ if (!(tuple->t_infomask & HEAP_XMAX_IS_MULTI))
+ {
+ xid = HeapTupleHeaderGetXmax(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+ }
+
+ if (tuple->t_infomask & HEAP_MOVED)
+ {
+ xid = HeapTupleHeaderGetXvac(tuple);
+ if (TransactionIdIsNormal(xid) &&
+ TransactionIdPrecedes(xid, cutoff_xid))
+ return true;
+ }
+
+ return false;
+}
/* ----------------
* heap_markpos - mark scan position
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b197b45..bbf8b8d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -117,6 +117,7 @@ static BufferAccessStrategy vac_strategy;
static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
Relation *Irel, int nindexes, bool scan_all);
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
+static bool lazy_check_needs_freeze(Buffer buf);
static void lazy_vacuum_index(Relation indrel,
IndexBulkDeleteResult **stats,
LVRelStats *vacrelstats);
@@ -453,8 +454,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacuum_delay_point();
- vacrelstats->scanned_pages++;
-
/*
* If we are close to overrunning the available space for dead-tuple
* TIDs, pause and do a cycle of vacuuming before we tackle this page.
@@ -486,7 +485,41 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RBM_NORMAL, vac_strategy);
/* We need buffer cleanup lock so that we can prune HOT chains. */
- LockBufferForCleanup(buf);
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ /*
+ * It's OK to skip vacuuming a page, as long as its not got data
+ * that needs to be cleaned for wraparound avoidance.
+ */
+ if (!scan_all)
+ {
+ ReleaseBuffer(buf);
+ continue;
+ }
+
+ /*
+ * If this is a wraparound checking vacuum, then we read the page
+ * with share lock to see if any xids need to be frozen. If the
+ * page doesn't need attention we just skip and continue. If it
+ * does, we wait for cleanup lock.
+ *
+ * We could defer the lock request further by remembering the page
+ * and coming back to it later, of we could even register
+ * ourselves for multiple buffers and then service whichever one
+ * is received first. For now, this seems good enough.
+ */
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ if (!lazy_check_needs_freeze(buf))
+ {
+ UnlockReleaseBuffer(buf);
+ continue;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBufferForCleanup(buf);
+ /* drop through to normal processing */
+ }
+
+ vacrelstats->scanned_pages++;
page = BufferGetPage(buf);
@@ -932,7 +965,8 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
vac_strategy);
- LockBufferForCleanup(buf);
+ if (!ConditionalLockBufferForCleanup(buf))
+ continue;
tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats);
/* Now that we've compacted the page, record its available space */
@@ -1010,6 +1044,50 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
}
/*
+ * lazy_check_needs_freeze() -- scan page to see if any tuples
+ * need to be cleaned to avoid wraparound
+ *
+ * Returns true if the page needs to be vacuumed using cleanup lock.
+ */
+static bool
+lazy_check_needs_freeze(Buffer buf)
+{
+ Page page;
+ OffsetNumber offnum,
+ maxoff;
+ HeapTupleHeader tupleheader;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page) || PageIsEmpty(page))
+ {
+ /* PageIsNew probably shouldn't happen... */
+ return false;
+ }
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid;
+
+ itemid = PageGetItemId(page, offnum);
+
+ if (!ItemIdIsNormal(itemid))
+ continue;
+
+ tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+
+ if (heap_tuple_needs_freeze(tupleheader, FreezeLimit, buf))
+ return true;
+ } /* scan along page */
+
+ return false;
+}
+
+
+/*
* lazy_vacuum_index() -- vacuum one index relation.
*
* Delete all the index entries pointing to tuples listed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 776ea5c..85cbeb3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -111,6 +111,8 @@ extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
Buffer buf);
+extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ Buffer buf);
extern Oid simple_heap_insert(Relation relation, HeapTuple tup);
extern void simple_heap_delete(Relation relation, ItemPointer tid);
On Fri, Nov 4, 2011 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Doing that actually makes the patch simpler, so I went ahead and did
that in the attached version, along with the renaming mentioned above.
Hearing no objections, I went ahead and committed this version.
Thanks for coding this up; I think this is a very useful improvement.
It would still be nice to fix the case where we need to freeze a tuple
that is on a page someone else has pinned, but I don't have any good
ideas for how to do that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 8, 2011 at 2:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
It would still be nice to fix the case where we need to freeze a tuple
that is on a page someone else has pinned, but I don't have any good
ideas for how to do that.
I think we need to avoid long pin hold times generally.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 8, 2011 at 2:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Nov 8, 2011 at 2:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
It would still be nice to fix the case where we need to freeze a tuple
that is on a page someone else has pinned, but I don't have any good
ideas for how to do that.I think we need to avoid long pin hold times generally.
In the case of a suspended sequential scan, which is the case where
this has most recently bitten me on a production system, it actually
seems rather unnecessary to hold the pin for a long period of time.
If we release the buffer pin, then someone could vacuum the buffer. I
haven't looked in detail at the issues, but in theory that doesn't
seem like a huge problem: just remember which TIDs you've already
looked at and, when you re-acquire the buffer, pick up where you left
off. Any tuples that have been vacuumed away meanwhile weren't going
to be visible to your scan anyway.
But there's an efficiency argument against doing it that way. First,
if we release the pin then we'll have to reacquire the buffer, which
means taking and releasing a BufMappingLock, the buffer header
spinlock, and the buffer content lock. Second, instead of returning a
pointer to the data in the page, we'll have to copy the data out of
the buffer before releasing the pin.
The situation is similar (perhaps even simpler) for index-only scans.
We could easily release the heap buffer pin after returning a tuple,
but it will make things much slower if the next heap fetch hits the
same page.
I wonder if we could arrange for a vacuum that's waiting for a cleanup
lock to signal the backends that could possibly be holding a
conflicting pin. Sort of like what the startup process does during
Hot Standby, except that instead of killing the people holding the
pins, we'd send them a signal that says "if at all possible, could you
please release those buffer pins right away?", and then the backends
would try to comply. Actually making that work though seems a bit
tricky, though, and getting it wrong would mean very, very rare,
nearly unreproducible bugs.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 8, 2011 at 1:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But there's an efficiency argument against doing it that way. First,
if we release the pin then we'll have to reacquire the buffer, which
means taking and releasing a BufMappingLock, the buffer header
spinlock, and the buffer content lock. Second, instead of returning a
pointer to the data in the page, we'll have to copy the data out of
the buffer before releasing the pin.
The only way I can see this working is to optimise this in the
planner, so that when we have a nested loop within a loop, we avoid
having the row on the outer loop pinned while we perform the inner
loop.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 8, 2011 at 10:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Nov 8, 2011 at 1:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But there's an efficiency argument against doing it that way. First,
if we release the pin then we'll have to reacquire the buffer, which
means taking and releasing a BufMappingLock, the buffer header
spinlock, and the buffer content lock. Second, instead of returning a
pointer to the data in the page, we'll have to copy the data out of
the buffer before releasing the pin.The only way I can see this working is to optimise this in the
planner, so that when we have a nested loop within a loop, we avoid
having the row on the outer loop pinned while we perform the inner
loop.
Hmm. I've actually never run into a problem that involved that
particular situation.
In any case, I think the issues are basically the same: keeping the
pin improves performance; dropping it helps VACUUM. Both are
important.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Nov 8, 2011 at 2:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need to avoid long pin hold times generally.
In the case of a suspended sequential scan, which is the case where
this has most recently bitten me on a production system, it actually
seems rather unnecessary to hold the pin for a long period of time.
If we release the buffer pin, then someone could vacuum the buffer.
This seems unlikely to be a productive line of thought. The only way
you could release buffer pin is if you first copied all the tuples you
need out of the page, and that seems like an unacceptable performance
hit. We should not be penalizing foreground query operations for the
benefit of background maintenance like VACUUM. (The fact that we do
an equivalent thing in btree index scans isn't an argument for doing
it here, because the tradeoffs are very different. In the index case,
the amount of data to be copied is a great deal less; the length of
time the lock would have to be held is often a great deal more; and
releasing the lock quickly gives a performance benefit for other
foreground operations, not only background maintenance.)
It strikes me that the only case where vacuum now has to wait is where
it needs to freeze an old XID. Couldn't it do that without insisting on
exclusive access? We only need exclusive access if we're going to move
data around, but we could have a code path in vacuum that just replaces
old XIDs with FrozenXID without moving/deleting anything.
regards, tom lane
On Tue, Nov 8, 2011 at 10:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It strikes me that the only case where vacuum now has to wait is where
it needs to freeze an old XID. Couldn't it do that without insisting on
exclusive access? We only need exclusive access if we're going to move
data around, but we could have a code path in vacuum that just replaces
old XIDs with FrozenXID without moving/deleting anything.
Interesting idea. I think in general we insist that you must have a
buffer content lock to inspect the tuple visibility info, in which
case that would be safe. But I'm not sure we do that absolutely
everywhere. For instance, just last night I noticed this:
/*
* If xmin isn't what we're expecting, the
slot must have been
* recycled and reused for an unrelated tuple.
This implies that
* the latest version of the row was deleted,
so we need do
* nothing. (Should be safe to examine xmin
without getting
* buffer's content lock, since xmin never
changes in an existing
* tuple.)
*/
if
(!TransactionIdEquals(HeapTupleHeaderGetXmin(tuple.t_data),
priorXmax))
{
ReleaseBuffer(buffer);
return NULL;
}
Maybe we can convince ourselves that that case is OK, or fixable; I'm
not sure whether there are any others.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
Interesting idea. I think in general we insist that you must have a
buffer content lock to inspect the tuple visibility info, in which
case that would be safe. But I'm not sure we do that absolutely
everywhere. For instance, just last night I noticed this:
/*
* If xmin isn't what we're expecting, the
slot must have been
* recycled and reused for an unrelated tuple.
This implies that
* the latest version of the row was deleted,
so we need do
* nothing. (Should be safe to examine xmin
without getting
* buffer's content lock, since xmin never
changes in an existing
* tuple.)
*/
if
Hmm ... I think that code is OK but the comment needs work. Here we are
necessarily looking for a pretty recent value of xmin (it has to be
later than GlobalXmin), so there's no need to worry that it might get
changed to FrozenXID.
regards, tom lane
On Tue, Nov 8, 2011 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Interesting idea. I think in general we insist that you must have a
buffer content lock to inspect the tuple visibility info, in which
case that would be safe. But I'm not sure we do that absolutely
everywhere. For instance, just last night I noticed this:/*
* If xmin isn't what we're expecting, the
slot must have been
* recycled and reused for an unrelated tuple.
This implies that
* the latest version of the row was deleted,
so we need do
* nothing. (Should be safe to examine xmin
without getting
* buffer's content lock, since xmin never
changes in an existing
* tuple.)
*/
ifHmm ... I think that code is OK but the comment needs work. Here we are
necessarily looking for a pretty recent value of xmin (it has to be
later than GlobalXmin), so there's no need to worry that it might get
changed to FrozenXID.
OK. Here's another possible concern: what happens if the page we're
freezing contains a dead tuple? It looks to me like
heap_freeze_tuple() is written so as not to require a cleanup lock -
indeed, the comments claim it's called when holding only a share lock
on the buffer, which doesn't appear to match what lazy_scan_heap() is
actually doing. But it does seem to assume that any tuples that still
exist are all-visible, which only works if vacuum has already pruned
the page.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 8, 2011 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Nov 8, 2011 at 2:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need to avoid long pin hold times generally.
In the case of a suspended sequential scan, which is the case where
this has most recently bitten me on a production system, it actually
seems rather unnecessary to hold the pin for a long period of time.
If we release the buffer pin, then someone could vacuum the buffer.This seems unlikely to be a productive line of thought. The only way
you could release buffer pin is if you first copied all the tuples you
need out of the page, and that seems like an unacceptable performance
hit. We should not be penalizing foreground query operations for the
benefit of background maintenance like VACUUM. (The fact that we do
an equivalent thing in btree index scans isn't an argument for doing
it here, because the tradeoffs are very different. In the index case,
the amount of data to be copied is a great deal less; the length of
time the lock would have to be held is often a great deal more; and
releasing the lock quickly gives a performance benefit for other
foreground operations, not only background maintenance.)It strikes me that the only case where vacuum now has to wait is where
it needs to freeze an old XID. Couldn't it do that without insisting on
exclusive access? We only need exclusive access if we're going to move
data around, but we could have a code path in vacuum that just replaces
old XIDs with FrozenXID without moving/deleting anything.
Holding buffer pins for a long time is a problem in Hot Standby also,
not just vacuum.
AFAIK seq scans already work page at a time for normal tables. So the
issue is when we *aren't* using a seq scan, e.g. nested loops joins.
Is there a way to solve that?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 9, 2011 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Holding buffer pins for a long time is a problem in Hot Standby also,
not just vacuum.
Agreed.
AFAIK seq scans already work page at a time for normal tables. So the
issue is when we *aren't* using a seq scan, e.g. nested loops joins.Is there a way to solve that?
Well, I'm not sure of the details of how page-at-a-time mode works for
seq scans, but I am absolutely 100% sure that you can reproduce this
problem using a cursor over a sequential scan. Just do this:
create table test (a text);
insert into test values ('aaa'), ('bbb');
delete from test where a = 'aaa';
begin;
declare x cursor for select * from test;
fetch next from x;
Then switch to another session and run "VACUUM test". Prior to commit
bbb6e559c4ea0fb4c346beda76736451dc24eb4e, this would hang. Now, it
doesn't. But "VACUUM FREEZE test" still does.
As for what to do about all this, I think Tom's idea would work for
good tuples, but the current freezing code can't handle dead tuples;
it counts on those having been already removed. I wonder if we could
just set xmin = InvalidTransactionId and set HEAP_XMIN_INVALID, or
something like that. I'm worried that there might be code out there
that thinks InvalidTransactionId can never appear in a real tuple.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Nov 9, 2011 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, I'm not sure of the details of how page-at-a-time mode works for
seq scans, but I am absolutely 100% sure that you can reproduce this
problem using a cursor over a sequential scan. Just do this:create table test (a text);
insert into test values ('aaa'), ('bbb');
delete from test where a = 'aaa';
begin;
declare x cursor for select * from test;
fetch next from x;
That's a bug. heapam.c line 1202 says
/*
* we can use page-at-a-time mode if it's an MVCC-safe snapshot
*/
scan->rs_pageatatime = IsMVCCSnapshot(snapshot);
So either the comment or the code is wrong.
Can't see where, as yet.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 9, 2011 at 9:48 PM, simon <simon@2ndQuadrant.com> wrote:
On Wed, Nov 9, 2011 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, I'm not sure of the details of how page-at-a-time mode works for
seq scans, but I am absolutely 100% sure that you can reproduce this
problem using a cursor over a sequential scan. Just do this:create table test (a text);
insert into test values ('aaa'), ('bbb');
delete from test where a = 'aaa';
begin;
declare x cursor for select * from test;
fetch next from x;That's a bug.
No, I'm wrong. It's not a bug at all.
heapgetpage() gets a page and a pin, but holds the pin until it reads
the next page. Wow!
That is both annoying and very dumb. It should hold the pin long
enough to copy the data and then release the pin.
It looks inefficient from a memory access viewpoint and from a pin
longevity viewpoint. If we copied out relevant data it would be more
cache efficient without affecting visibility. Looking at a patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes:
As for what to do about all this, I think Tom's idea would work for
good tuples, but the current freezing code can't handle dead tuples;
it counts on those having been already removed.
I have not gone back to look at the code, but are you worried about the
fact that it doesn't consider replacing xmax with FrozenTransactionId?
Surely we could do that if we wanted. It just never seemed necessary
before --- but if vacuum is to be allowed to punt repeatedly on the same
page, maybe we do need to cover the case.
regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes:
heapgetpage() gets a page and a pin, but holds the pin until it reads
the next page. Wow!
That is both annoying and very dumb. It should hold the pin long
enough to copy the data and then release the pin.
I don't find that anywhere near as obvious as you seem to. I think you
are trying to optimize for the wrong set of conditions.
I will also note that the behavior of holding pin for as long as we are
stopped on a particular tuple is not specific to seqscans.
regards, tom lane
On Wed, Nov 9, 2011 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
heapgetpage() gets a page and a pin, but holds the pin until it reads
the next page. Wow!That is both annoying and very dumb. It should hold the pin long
enough to copy the data and then release the pin.I don't find that anywhere near as obvious as you seem to. I think you
are trying to optimize for the wrong set of conditions.
ISTM we should optimise to access the cachelines in the buffer once.
Holding a pin and re-accessing the buffer via main memory seems pretty
bad plan to me. Which conditions are being optimised by doing that?
I will also note that the behavior of holding pin for as long as we are
stopped on a particular tuple is not specific to seqscans.
Agreed. Bad things may happen in more than one place.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 9, 2011 at 5:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
As for what to do about all this, I think Tom's idea would work for
good tuples, but the current freezing code can't handle dead tuples;
it counts on those having been already removed.I have not gone back to look at the code, but are you worried about the
fact that it doesn't consider replacing xmax with FrozenTransactionId?
Surely we could do that if we wanted. It just never seemed necessary
before --- but if vacuum is to be allowed to punt repeatedly on the same
page, maybe we do need to cover the case.
No, I'm worried about the fact that that it does this:
xid = HeapTupleHeaderGetXmin(tuple);
if (TransactionIdIsNormal(xid) &&
TransactionIdPrecedes(xid, cutoff_xid))
{
if (buf != InvalidBuffer)
{
/* trade in share lock for exclusive lock */
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
buf = InvalidBuffer;
}
HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
Note that the ONLY thing we're checking with respect to the tuple xmin
is that it's a normal XID that precedes the cutoff. We are not
checking whether it's committed. So there had better not be any
tuples left on the page that are from inserting transactions that
aborted, or this is going to go boom.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Nov 9, 2011 at 6:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Nov 9, 2011 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
heapgetpage() gets a page and a pin, but holds the pin until it reads
the next page. Wow!That is both annoying and very dumb. It should hold the pin long
enough to copy the data and then release the pin.I don't find that anywhere near as obvious as you seem to. I think you
are trying to optimize for the wrong set of conditions.ISTM we should optimise to access the cachelines in the buffer once.
Holding a pin and re-accessing the buffer via main memory seems pretty
bad plan to me. Which conditions are being optimised by doing that?
I believe it reduces memory copying. But we can certainly test some
alternative you may have in mind and see how it shakes out.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Zombie thread arise!
I was searching for old threads on a specific problem and came across
this patch that was dropped due to concerns about SnapshotNow:
On Wed, Nov 2, 2011 at 3:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jun 7, 2011 at 3:24 PM, Greg Stark <gsstark@mit.edu> wrote:
Well it's super-exclusive-vacuum-lock avoidance techniques. Why
shouldn't it make more sense to try to reduce the frequency and impact
of the single-purpose outlier in a non-critical-path instead of
burdening every other data reader with extra overhead?I think Robert's plan is exactly right though I would phrase it
differently. We should get the exclusive lock, freeze/kill any xids
and line pointers, then if the pin-count is 1 do the compaction.I wrote a really neat patch to do this today... and then, as I
thought about it some more, I started to think that it's probably
unsafe. Here's the trouble: with this approach, we assume that it's
OK to change the contents of the line pointer while holding only an
exclusive lock on the buffer. But there is a good deal of code out
there that thinks it's OK to examine a line pointer with only a pin on
the buffer (no lock). You need a content lock to scan the item
pointer array, but once you've identified an item of interest, you're
entitled to assume that it won't be modified while you hold a buffer
pin. Now, if you've identified a particular tuple as being visible to
your scan, then you might think that VACUUM shouldn't be removing it
anyway. But I think that's only true for MVCC scans - for example,
what happens under SnapshotNow semantics?
Well now that you've done all that amazing work eliminating
SnapshotNow maybe this patch deserves another look? I see
anti-wraparound vacuums more and more often as database transaction
velocity increases and vacuum takes longer and longer as database
sizes increase. Having a faster vacuum that can skip vacuuming pages
but is still guaranteed to freeze every page is pretty tempting.
Of course the freeze epoch in the header obviating the need for
freezing is an even more attractive goal but AIUI we're not even sure
that's going to work and this is a nice easy win.
But then then on third
thought, if you've also got an MVCC snapshot taken before the start of
the SnapshotNow scan, you are probably OK, because your advertised
xmin should prevent anyone from removing anything anyway, so how do
you actually provoke a failure?Anyway, I'm attaching the patch, in case anyone has any ideas on where
to go with this.I'm really wishing we had more bits in the vm. It looks like we could use:
- contains not-all-visible tuples
- contains not-frozen xids
- in need of compaction
Another idea would be to store the upper 2-4 bits of the oldest xid in
the page. That would let you determine whether it's in need of an
anti-wraparound vacuum.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 30, 2014 at 7:55 PM, Greg Stark <stark@mit.edu> wrote:
On Wed, Nov 2, 2011 at 3:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jun 7, 2011 at 3:24 PM, Greg Stark <gsstark@mit.edu> wrote:
Well it's super-exclusive-vacuum-lock avoidance techniques. Why
shouldn't it make more sense to try to reduce the frequency and impact
of the single-purpose outlier in a non-critical-path instead of
burdening every other data reader with extra overhead?I think Robert's plan is exactly right though I would phrase it
differently. We should get the exclusive lock, freeze/kill any xids
and line pointers, then if the pin-count is 1 do the compaction.I wrote a really neat patch to do this today... and then, as I
thought about it some more, I started to think that it's probably
unsafe. Here's the trouble: with this approach, we assume that it's
OK to change the contents of the line pointer while holding only an
exclusive lock on the buffer. But there is a good deal of code out
there that thinks it's OK to examine a line pointer with only a pin on
the buffer (no lock). You need a content lock to scan the item
pointer array, but once you've identified an item of interest, you're
entitled to assume that it won't be modified while you hold a buffer
pin. Now, if you've identified a particular tuple as being visible to
your scan, then you might think that VACUUM shouldn't be removing it
anyway. But I think that's only true for MVCC scans - for example,
what happens under SnapshotNow semantics?Well now that you've done all that amazing work eliminating
SnapshotNow ...
Thank you. :-)
... maybe this patch deserves another look? I see
anti-wraparound vacuums more and more often as database transaction
velocity increases and vacuum takes longer and longer as database
sizes increase. Having a faster vacuum that can skip vacuuming pages
but is still guaranteed to freeze every page is pretty tempting.Of course the freeze epoch in the header obviating the need for
freezing is an even more attractive goal but AIUI we're not even sure
that's going to work and this is a nice easy win.
Well, there's still SnapshotSelf, SnapshotAny, SnapshotDirty, ...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers