Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Started by Peter Geogheganover 4 years ago139 messageshackers

pg@bowt.ie

over 4 years ago

Attached WIP patch series significantly simplifies the definition of
scanned_pages inside vacuumlazy.c. Apart from making several very
tricky things a lot simpler, and moving more complex code outside of
the big "blkno" loop inside lazy_scan_heap (building on the Postgres
14 work), this refactoring directly facilitates 2 new optimizations
(also in the patch):

1. We now collect LP_DEAD items into the dead_tuples array for all
scanned pages -- even when we cannot get a cleanup lock.

2. We now don't give up on advancing relfrozenxid during a
non-aggressive VACUUM when we happen to be unable to get a cleanup
lock on a heap page.

Both optimizations are much more natural with the refactoring in
place. Especially #2, which can be thought of as making aggressive and
non-aggressive VACUUM behave similarly. Sure, we shouldn't wait for a
cleanup lock in a non-aggressive VACUUM (by definition) -- and we
still don't in the patch (obviously). But why wouldn't we at least
*check* if the page has tuples that need to be frozen in order for us
to advance relfrozenxid? Why give up on advancing relfrozenxid in a
non-aggressive VACUUM when there's no good reason to?

See the draft commit messages from the patch series for many more
details on the simplifications I am proposing.

I'm not sure how much value the second optimization has on its own.
But I am sure that the general idea of teaching non-aggressive VACUUM
to be conscious of the value of advancing relfrozenxid is a good one
-- and so #2 is a good start on that work, at least. I've discussed
this idea with Andres (CC'd) a few times before now. Maybe we'll need
another patch that makes VACUUM avoid setting heap pages to
all-visible without also setting them to all-frozen (and freezing as
necessary) in order to really get a benefit. Since, of course, a
non-aggressive VACUUM still won't be able to advance relfrozenxid when
it skipped over all-visible pages that are not also known to be
all-frozen.

Masahiko (CC'd) has expressed interest in working on opportunistic
freezing. This refactoring patch seems related to that general area,
too. At a high level, to me, this seems like the tuple freezing
equivalent of the Postgres 14 work on bypassing index vacuuming when
there are very few LP_DEAD items (interpret that as 0 LP_DEAD items,
which is close to the truth anyway). There are probably quite a few
interesting opportunities to make VACUUM better by not having such a
sharp distinction between aggressive and non-aggressive VACUUM. Why
should they be so different? A good medium term goal might be to
completely eliminate aggressive VACUUMs.

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway). This patch seems
relevant to that area because it recognizes that pruning during VACUUM
is not necessarily special -- a new function called lazy_scan_noprune
may be used instead of lazy_scan_prune (though only when a cleanup
lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems
reasonable to round up to 100% true).

I find it easy to imagine generalizing the same basic idea --
recognizing more ways in which pruning by VACUUM isn't necessarily
better than opportunistic pruning, at the level of each heap page. Of
course we *need* to prune sometimes (e.g., might be necessary to do so
to set the page all-visible in the visibility map), but why bother
when we don't, and when there is no reason to think that it'll help
anyway? Something to think about, at least.

--
Peter Geoghegan

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Peter Geoghegan (#1)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-21 18:13:51 -0800, Peter Geoghegan wrote:

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

As I noted below, I think this is a bit of a separate issue than what your
changes address in this patch.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway).

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".

This patch seems relevant to that area because it recognizes that pruning
during VACUUM is not necessarily special -- a new function called
lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a
cleanup lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable
to round up to 100% true).

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...

But we now make the non-aggressive case notice that a failure to acquire a
cleanup lock on one particular heap page does not in itself make it unsafe
to advance relfrozenxid for the whole relation (which is what we usually see
in the aggressive case already).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later. In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...

It would also be generally useful if aggressive VACUUMs were "less
aggressive" opportunistically (e.g. by being responsive to query
cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age. The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.

+	/*
+	 * Aggressive VACUUM (which is the same thing as anti-wraparound
+	 * autovacuum for most practical purposes) exists so that we'll reliably
+	 * advance relfrozenxid and relminmxid sooner or later.  But we can often
+	 * opportunistically advance them even in a non-aggressive VACUUM.
+	 * Consider if that's possible now.

I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.

+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_prune, from before a possible relation
+	 * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+	 */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

@@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
bool all_visible_according_to_vm = false;
LVPagePruneState prunestate;

- /*
- * Consider need to skip blocks. See note above about forcing
- * scanning of last page.
- */
-#define FORCE_CHECK_PAGE() \
- (blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);

update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
blkno, InvalidOffsetNumber);

+		/*
+		 * Consider need to skip blocks
+		 */
if (blkno == next_unskippable_block)
{
/* Time to advance next_unskippable_block */
@@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
else
{
/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block can be skipped if we've seen a long enough
+			 * run of skippable blocks to justify skipping it.
+			 *
+			 * There is an exception: we will scan the table's last page to
+			 * determine whether it has tuples or not, even if it would
+			 * otherwise be skipped (unless it's clearly not worth trying to
+			 * truncate the table).  This avoids having lazy_truncate_heap()
+			 * take access-exclusive lock on the table to attempt a truncation
+			 * that just fails immediately because there are tuples in the
+			 * last page.
*/
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks &&
+				!(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
{
/*
* Tricky, tricky.  If this is in aggressive vacuum, the page

I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.

@@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
+		vacrel->scanned_pages++;

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing in lazy_scan_noprune.
*/

s/in lazy_scan_noprune/via lazy_scan_noprune/?

if (!ConditionalLockBufferForCleanup(buf))
{
bool hastup;

-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
+				/* Lock and pin released for us */
+				continue;
+			}

Why isn't this done in lazy_scan_noprune()?

+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+			{
+				/* No need to wait for cleanup lock for this page */
+				UnlockReleaseBuffer(buf);
+				if (hastup)
+					vacrel->nonempty_pages = blkno + 1;
continue;
}

Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?

+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
{

Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?

+/*
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).

I don't think it's actually that rare - the window for this is huge. You just
need to crash / immediate shutdown at any time between the relation having
been extended and the new page contents being written out (checkpoint or
buffer replacement / ring writeout). That's often many minutes.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.

+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */

Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?

+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)

It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.

+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.

I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.

+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page won't be truncatable */
+			continue;
+		}

It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().

+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
+				 *
+				 * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
+				 * and never has to deal with DEAD tuples directly (they
+				 * reliably become LP_DEAD items through pruning).  Our
+				 * approach to DEAD tuples is a bit arbitrary, but it seems
+				 * better than totally ignoring them.
+				 */
+				new_dead_tuples++;
+				break;

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array iff VACUUM uses two-pass strategy case
+	 */

Do we really need to have separate code for this in lazy_scan_prune() and
lazy_scan_noprune()?

+	}
+	else
+	{
+		/*
+		 * We opt to skip FSM processing for the page on the grounds that it
+		 * is probably being modified by concurrent DML operations.  Seems
+		 * best to assume that the space is best left behind for future
+		 * updates of existing tuples.  This matches what opportunistic
+		 * pruning does.

Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I think this is a change mostly in the right direction. But as formulated this
commit does *WAY* too much at once.

Greetings,

Andres Freund

Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: