Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Started by Peter Geogheganover 4 years ago139 messageshackers
Jump to latest

Attached WIP patch series significantly simplifies the definition of
scanned_pages inside vacuumlazy.c. Apart from making several very
tricky things a lot simpler, and moving more complex code outside of
the big "blkno" loop inside lazy_scan_heap (building on the Postgres
14 work), this refactoring directly facilitates 2 new optimizations
(also in the patch):

1. We now collect LP_DEAD items into the dead_tuples array for all
scanned pages -- even when we cannot get a cleanup lock.

2. We now don't give up on advancing relfrozenxid during a
non-aggressive VACUUM when we happen to be unable to get a cleanup
lock on a heap page.

Both optimizations are much more natural with the refactoring in
place. Especially #2, which can be thought of as making aggressive and
non-aggressive VACUUM behave similarly. Sure, we shouldn't wait for a
cleanup lock in a non-aggressive VACUUM (by definition) -- and we
still don't in the patch (obviously). But why wouldn't we at least
*check* if the page has tuples that need to be frozen in order for us
to advance relfrozenxid? Why give up on advancing relfrozenxid in a
non-aggressive VACUUM when there's no good reason to?

See the draft commit messages from the patch series for many more
details on the simplifications I am proposing.

I'm not sure how much value the second optimization has on its own.
But I am sure that the general idea of teaching non-aggressive VACUUM
to be conscious of the value of advancing relfrozenxid is a good one
-- and so #2 is a good start on that work, at least. I've discussed
this idea with Andres (CC'd) a few times before now. Maybe we'll need
another patch that makes VACUUM avoid setting heap pages to
all-visible without also setting them to all-frozen (and freezing as
necessary) in order to really get a benefit. Since, of course, a
non-aggressive VACUUM still won't be able to advance relfrozenxid when
it skipped over all-visible pages that are not also known to be
all-frozen.

Masahiko (CC'd) has expressed interest in working on opportunistic
freezing. This refactoring patch seems related to that general area,
too. At a high level, to me, this seems like the tuple freezing
equivalent of the Postgres 14 work on bypassing index vacuuming when
there are very few LP_DEAD items (interpret that as 0 LP_DEAD items,
which is close to the truth anyway). There are probably quite a few
interesting opportunities to make VACUUM better by not having such a
sharp distinction between aggressive and non-aggressive VACUUM. Why
should they be so different? A good medium term goal might be to
completely eliminate aggressive VACUUMs.

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway). This patch seems
relevant to that area because it recognizes that pruning during VACUUM
is not necessarily special -- a new function called lazy_scan_noprune
may be used instead of lazy_scan_prune (though only when a cleanup
lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems
reasonable to round up to 100% true).

I find it easy to imagine generalizing the same basic idea --
recognizing more ways in which pruning by VACUUM isn't necessarily
better than opportunistic pruning, at the level of each heap page. Of
course we *need* to prune sometimes (e.g., might be necessary to do so
to set the page all-visible in the visibility map), but why bother
when we don't, and when there is no reason to think that it'll help
anyway? Something to think about, at least.

--
Peter Geoghegan

Attachments:

v1-0002-Improve-log_autovacuum_min_duration-output.patchapplication/octet-stream; name=v1-0002-Improve-log_autovacuum_min_duration-output.patchDownload+47-9
v1-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v1-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload+500-302
#2Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#1)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-21 18:13:51 -0800, Peter Geoghegan wrote:

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

As I noted below, I think this is a bit of a separate issue than what your
changes address in this patch.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway).

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".

This patch seems relevant to that area because it recognizes that pruning
during VACUUM is not necessarily special -- a new function called
lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a
cleanup lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable
to round up to 100% true).

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...

But we now make the non-aggressive case notice that a failure to acquire a
cleanup lock on one particular heap page does not in itself make it unsafe
to advance relfrozenxid for the whole relation (which is what we usually see
in the aggressive case already).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later. In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...

It would also be generally useful if aggressive VACUUMs were "less
aggressive" opportunistically (e.g. by being responsive to query
cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age. The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.

+	/*
+	 * Aggressive VACUUM (which is the same thing as anti-wraparound
+	 * autovacuum for most practical purposes) exists so that we'll reliably
+	 * advance relfrozenxid and relminmxid sooner or later.  But we can often
+	 * opportunistically advance them even in a non-aggressive VACUUM.
+	 * Consider if that's possible now.

I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.

+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_prune, from before a possible relation
+	 * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+	 */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

@@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
bool all_visible_according_to_vm = false;
LVPagePruneState prunestate;

- /*
- * Consider need to skip blocks. See note above about forcing
- * scanning of last page.
- */
-#define FORCE_CHECK_PAGE() \
- (blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);

update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
blkno, InvalidOffsetNumber);

+		/*
+		 * Consider need to skip blocks
+		 */
if (blkno == next_unskippable_block)
{
/* Time to advance next_unskippable_block */
@@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
else
{
/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block can be skipped if we've seen a long enough
+			 * run of skippable blocks to justify skipping it.
+			 *
+			 * There is an exception: we will scan the table's last page to
+			 * determine whether it has tuples or not, even if it would
+			 * otherwise be skipped (unless it's clearly not worth trying to
+			 * truncate the table).  This avoids having lazy_truncate_heap()
+			 * take access-exclusive lock on the table to attempt a truncation
+			 * that just fails immediately because there are tuples in the
+			 * last page.
*/
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks &&
+				!(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
{
/*
* Tricky, tricky.  If this is in aggressive vacuum, the page

I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.

@@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)

buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
+		vacrel->scanned_pages++;

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing in lazy_scan_noprune.
*/

s/in lazy_scan_noprune/via lazy_scan_noprune/?

if (!ConditionalLockBufferForCleanup(buf))
{
bool hastup;

-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
+				/* Lock and pin released for us */
+				continue;
+			}

Why isn't this done in lazy_scan_noprune()?

+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+			{
+				/* No need to wait for cleanup lock for this page */
+				UnlockReleaseBuffer(buf);
+				if (hastup)
+					vacrel->nonempty_pages = blkno + 1;
continue;
}

Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?

+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
{

Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?

+/*
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).

I don't think it's actually that rare - the window for this is huge. You just
need to crash / immediate shutdown at any time between the relation having
been extended and the new page contents being written out (checkpoint or
buffer replacement / ring writeout). That's often many minutes.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.

+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */

Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?

+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)

It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.

+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.

I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.

+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page won't be truncatable */
+			continue;
+		}

It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().

+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
+				 *
+				 * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
+				 * and never has to deal with DEAD tuples directly (they
+				 * reliably become LP_DEAD items through pruning).  Our
+				 * approach to DEAD tuples is a bit arbitrary, but it seems
+				 * better than totally ignoring them.
+				 */
+				new_dead_tuples++;
+				break;

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array iff VACUUM uses two-pass strategy case
+	 */

Do we really need to have separate code for this in lazy_scan_prune() and
lazy_scan_noprune()?

+	}
+	else
+	{
+		/*
+		 * We opt to skip FSM processing for the page on the grounds that it
+		 * is probably being modified by concurrent DML operations.  Seems
+		 * best to assume that the space is best left behind for future
+		 * updates of existing tuples.  This matches what opportunistic
+		 * pruning does.

Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I think this is a change mostly in the right direction. But as formulated this
commit does *WAY* too much at once.

Greetings,

Andres Freund

In reply to: Andres Freund (#2)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Nov 22, 2021 at 11:29 AM Andres Freund <andres@anarazel.de> wrote:

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Fair enough. I just wanted to suggest an exploratory conversation
about pruning (among several other things). I'm mostly saying: hey,
pruning during VACUUM isn't actually that special, at least not with
this refactoring patch in place. So maybe it makes sense to go
further, in light of that general observation about pruning in VACUUM.

Maybe it wasn't useful to even mention this aspect now. I would rather
focus on freezing optimizations for now -- that's much more promising.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".

Agreed. (I tend to say the former when I really mean the latter, which
I should try to avoid.)

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.

Don't get me started on this - because I'll never stop.

It makes zero sense that we don't think about free space holistically,
using the whole context of what changed in the recent past. As I think
you know already, a higher level concept (like open and closed pages)
seems like the right direction to me -- because it isn't sensible to
treat X bytes of free space in one heap page as essentially
interchangeable with any other space on any other heap page. That
misses an enormous amount of things that matter. The all-visible
status of a page is just one such thing.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...

I know what you mean by that, of course. But FWIW that definition
seems too focused on what actually happens today, rather than what is
essential given the invariants we have for VACUUM. And so I personally
prefer to define it as "a VACUUM that *reliably* advances
relfrozenxid". This looser definition will probably "age" well (ahem).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later. In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...

I was speculating about future work again. I think that you've taken
my words too literally. This is just a draft commit message, just a
way of framing what I'm really trying to do.

Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
non-aggressive VACUUM -- so "at least waiting for one or two pins
during non-aggressive VACUUM" might not have been the best way of
expressing the idea that I wanted to express. The important point is
that _we can make a choice_ about stuff like this dynamically, based
on the observed characteristics of the table, and some general ideas
about the costs and benefits (of waiting or not waiting, or of how
long we want to wait in total, whatever might be important). This
probably just means adding some heuristics that are pretty sensitive
to any reason to not do more work in a non-aggressive VACUUM, without
*completely* balking at doing even a tiny bit more work.

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once, especially if we're already pretty
sure that that's all the extra work that it would take to ultimately
be able to advance relfrozenxid in the ongoing (non-aggressive) VACUUM
-- it's easy to make that case. Once you agree that it makes sense
under these favorable circumstances, you've already made
"aggressiveness" a continuous thing conceptually, at a high level.

The current binary definition of "aggressive" is needlessly
restrictive -- that much seems clear to me. I'm much less sure of what
specific alternative should replace it.

I've already prototyped advancing relfrozenxid using a dynamically
determined value, so that our final relfrozenxid is just about the
most recent safe value (not the original FreezeLimit). That's been
interesting. Consider this log output from an autovacuum with the
prototype patch (also uses my new instrumentation), based on standard
pgbench (just tuned heap fill factor a bit):

LOG: automatic vacuum of table "regression.public.pgbench_accounts":
index scans: 0
pages: 0 removed, 909091 remain, 33559 skipped using visibility map
(3.69% of total)
tuples: 297113 removed, 50090880 remain, 90880 are dead but not yet removable
removal cutoff: oldest xmin was 29296744, which is now 203341 xact IDs behind
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
I/O timings: read: 55.574 ms, write: 0.000 ms
avg read rate: 17.805 MB/s, avg write rate: 4.389 MB/s
buffer usage: 1728273 hits, 23150 misses, 5706 dirtied
WAL usage: 594211 records, 0 full page images, 35065032 bytes
system usage: CPU: user: 6.85 s, system: 0.08 s, elapsed: 10.15 s

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't). Whereas for the smaller pgbench tables, every
single VACUUM successfully advances relfrozenxid to a fairly recent
XID (without there ever being an aggressive VACUUM) -- just because
VACUUM needs to visit every page for the smaller tables. While the
accounts table doesn't generally need to have 100% of all pages
touched by VACUUM -- it's more like 95% there. Does that really make
sense, though?

I'm pretty sure that less aggressive VACUUMing (e.g. higher
scale_factor setting) would lead to more aggressive setting of
relfrozenxid here. I'm always suspicious when I see insignificant
differences that lead to significant behavioral differences. Am I
worried over nothing here? Perhaps -- we don't really need to advance
relfrozenxid early with this table/workload anyway. But I'm not so
sure.

Again, my point is that there is a good chance that redefining
aggressiveness in some way will be helpful. A more creative, flexible
definition might be just what we need. The details are very much up in
the air, though.

It would also be generally useful if aggressive VACUUMs were "less
aggressive" opportunistically (e.g. by being responsive to query
cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age.

You know what I meant. Also, did *you* mean "being canceleable is
already a different concept to *aggressive* vacuums"? :-)

The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

Not sure what you mean about scheduling, since vacuum_freeze_table_age
is only in place to make overnight (off hours low activity scripted
VACUUMs) freeze tuples before any autovacuum worker gets the chance
(since the latter may run at a much less convenient time). Sure,
vacuum_freeze_table_age might also force a regular autovacuum worker
to do an aggressive VACUUM -- but I think it's mostly intended for a
manual overnight VACUUM. Not usually very helpful, but also not
harmful.

Oh, wait. I think that you're talking about how autovacuum workers in
particular tend to be affected by this. We launch an av worker that
wants to clean up bloat, but it ends up being aggressive (and maybe
taking way longer), perhaps quite randomly, only due to
vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
that it?

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

What do you mean? Only an autovacuum worker can do an anti-wraparound
VACUUM (which is not quite the same thing as an aggressive VACUUM).

I agree that anti-wraparound autovacuum is way too unfriendly, though.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

I haven't seen any evidence of that myself (with the optimization
added to Postgres 14 by commit 5100010ee4). I still don't understand
why you doubted that work so much. I'm not saying that you're wrong
to; I'm saying that I don't think that I understand your perspective
on it.

What I have seen in my own tests (particularly with BenchmarkSQL) is
that most individual tables either never apply the optimization even
once (because the table reliably has heap pages with many more LP_DEAD
items than the 2%-of-relpages threshold), or will never need to
(because there are precisely zero LP_DEAD items anyway). Remaining
tables that *might* use the optimization tend to not go very long
without actually getting a round of index vacuuming. It's just too
easy for updates (and even aborted xact inserts) to introduce new
LP_DEAD items for us to go long without doing index vacuuming.

If you can be more concrete about a problem you've seen, then I might
be able to help. It's not like there are no options in this already. I
already thought about introducing a small degree of randomness into
the process of deciding to skip or to not skip (in the
consider_bypass_optimization path of lazy_vacuum() on Postgres 14).
The optimization is mostly valuable because it allows us to do more
useful work in VACUUM -- not because it allows us to do less useless
work in VACUUM. In particular, it allows to tune
autovacuum_vacuum_insert_scale_factor very aggressively with an
append-only table, without useless index vacuuming making it all but
impossible for autovacuum to get to the useful work.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

Another idea is to truncate the line pointer during pruning (including
opportunistic pruning). Matthias van de Meent has a patch for that.

I am not aware of a specific workload where the patch helps, but that
doesn't mean that there isn't one, or that it doesn't matter. It's
subtle enough that I might have just missed something. I *expect* the
true damage over time to be very hard to model or understand -- I
imagine the potential for weird feedback loops is there.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.

I believe Masahiko is working on this in the current cycle. It would
be easier if we had a better sense of how increasing
MaxHeapTuplesPerPage will affect tidbitmap.c. But the idea of
increasing that seems sound to me.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

Is that still true, though? If it turns out that we need to leave it
in, then I can do that. But I'd prefer to wait until we have more
information before making a final decision. Remember, the high level
idea of this whole patch is that we do as much work as possible for
any scanned_pages, which now includes pages that we never successfully
acquired a cleanup lock on. And so we're justified in assuming that
they're exactly equivalent to pages that we did get a cleanup on --
that's now the working assumption. I know that that's not literally
true, but that doesn't mean it's not a useful fiction -- it should be
very close to the truth.

Also, I would like to put more information (much more useful
information) in the same log output. Perhaps that will be less
controversial if I take something useless away first.

I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.

Somebody should normalize that.

+     /*
+      * Aggressive VACUUM (which is the same thing as anti-wraparound
+      * autovacuum for most practical purposes) exists so that we'll reliably
+      * advance relfrozenxid and relminmxid sooner or later.  But we can often
+      * opportunistically advance them even in a non-aggressive VACUUM.
+      * Consider if that's possible now.

I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.

Okay.

+      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+      * the rel_pages used by lazy_scan_prune, from before a possible relation
+      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+      */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

I would argue that we already have one -- vacuum-reltuples.spec. I had
to update its expected output in the patch. I would argue that the
behavioral change (count tuples on a pinned-by-cursor heap page) that
necessitated updating the expected output for the test is an
improvement overall.

+     {
+             /* Can safely advance relfrozen and relminmxid, too */
+             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+                        orig_rel_pages);
+             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+                                                     new_rel_allvisible, vacrel->nindexes > 0,
+                                                     FreezeLimit, MultiXactCutoff, false);
+     }

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

There is no question that that makes sense as an optimization -- my
prototype convinced me of that already. But I don't think that it can
simplify anything (not even the call to vac_update_relstats itself, to
actually update relfrozenxid at the end). Fundamentally, this will
only work if we decide to only skip all-frozen pages, which (by
definition) only happens within aggressive VACUUMs. Isn't it that
simple?

You recently said (on the heap-pruning-14-bug thread) that you don't
think it would be practical to always set a page all-frozen when we
see that we're going to set it all-visible -- apparently you feel that
we could never opportunistically freeze early such that all-visible
but not all-frozen pages practically cease to exist. I'm still not
sure why you believe that (though you may be right, or I might have
misunderstood, since it's complicated). It would certainly benefit
this dynamic relfrozenxid business if it was possible, though. If we
could somehow make that work, then almost every VACUUM would be able
to advance relfrozenxid, independently of aggressive-ness -- because
we wouldn't have any all-visible-but-not-all-frozen pages to skip
(that important detail wouldn't be left to chance).

-                     if (skipping_blocks && !FORCE_CHECK_PAGE())
+                     if (skipping_blocks &&
+                             !(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
{
/*
* Tricky, tricky.  If this is in aggressive vacuum, the page

I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.

I got rid of FORCE_CHECK_PAGE() itself in this patch (not a later
patch) because the patch also removes the only other
FORCE_CHECK_PAGE() call -- and the latter change is very much in scope
for the big patch (can't be broken down into smaller changes, I
think). And so this felt natural to me. But if you prefer, I can break
it out into a separate commit.

@@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)

buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vacrel->bstrategy);
+             page = BufferGetPage(buf);
+             vacrel->scanned_pages++;

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

Since it is often legit to read a heap page without a buffer lock
(only a pin), I can't see why BufferGetPage() without a buffer lock
shouldn't also be okay -- if anything it seems safer. I think that I
would agree with you if it wasn't for that inconsistency (which is
rather a big "if", to be sure -- even for me).

+                     /* Check for new or empty pages before lazy_scan_noprune call */
+                     if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+                                                                        vmbuffer))
{
-                             ReleaseBuffer(buf);
-                             vacrel->pinskipped_pages++;
+                             /* Lock and pin released for us */
+                             continue;
+                     }

Why isn't this done in lazy_scan_noprune()?

No reason, really -- could be done that way (we'd then also give
lazy_scan_prune the same treatment). I thought that it made a certain
amount of sense to keep some of this in the main loop, but I can
change it if you want.

+                     if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+                     {
+                             /* No need to wait for cleanup lock for this page */
+                             UnlockReleaseBuffer(buf);
+                             if (hastup)
+                                     vacrel->nonempty_pages = blkno + 1;
continue;
}

Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?

This just matches the existing lazy_scan_prune function (which doesn't
mean all that much, since it was only added in Postgres 14). Will add
the assertion to both.

+             /* Check for new or empty pages before lazy_scan_prune call */
+             if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
{

Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?

Isn't that obvious? Either way it isn't the kind of thing that I'd try
to optimize away. It's such a narrow issue.

+/*
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).

I don't think it's actually that rare - the window for this is huge.

I can just remove the comment, though it still makes sense to me.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.

My long term goal for the FSM (the lease based design I talked about
earlier this year) includes soft ownership of free space from
preallocated pages by individual xacts -- the smgr layer itself
becomes transactional and crash safe (at least to a limited degree).
This includes bulk extension of relations, to make up for the new
overhead implied by crash safe rel extension. I don't think that we
should require VACUUM (or anything else) to be cool with random
uninitialized pages -- to me that just seems backwards.

We can't do true bulk extension right now (just an inferior version
that doesn't give specific pages to specific backends) because the
risk of losing a bunch of empty pages for way too long is not
acceptable. But that doesn't seem fundamental to me -- that's one of
the things we'd be fixing at the same time (through what I call soft
ownership semantics). I think we'd come out ahead on performance, and
*also* have a more robust approach to relation extension.

+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */

Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?

Okay.

+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+                                        Page page, bool sharelock, Buffer vmbuffer)

It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.

Okay.

+/*
+ *   lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.

I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.

Will fix.

+             if (ItemIdIsRedirected(itemid))
+             {
+                     *hastup = true;         /* page won't be truncatable */
+                     continue;
+             }

It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().

I didn't think of that. Will address it in the next version.

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

Compared to what? Do we really want to invent a new kind of DEAD tuple
(e.g., to report on), just to handle this rare case?

I accept that this code is lying about the tuples being RECENTLY_DEAD,
kind of. But isn't it still strictly closer to the truth, compared to
HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
counting it at all.

Note that we don't remember LP_DEAD items here, either (not here, in
lazy_scan_noprune, and not in lazy_scan_prune on HEAD). Because we
pretty much interpret LP_DEAD items as "future LP_UNUSED items"
instead -- we make a soft assumption that we're going to go on to mark
the same items LP_UNUSED during a second pass over the heap. My point
is that there is no natural way to count "fully DEAD tuple that
autovacuum didn't deal with" -- and so I picked RECENTLY_DEAD.

+     /*
+      * Now save details of the LP_DEAD items from the page in the dead_tuples
+      * array iff VACUUM uses two-pass strategy case
+      */

Do we really need to have separate code for this in lazy_scan_prune() and
lazy_scan_noprune()?

There is hardly any repetition, though.

+     }
+     else
+     {
+             /*
+              * We opt to skip FSM processing for the page on the grounds that it
+              * is probably being modified by concurrent DML operations.  Seems
+              * best to assume that the space is best left behind for future
+              * updates of existing tuples.  This matches what opportunistic
+              * pruning does.

Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I actually agree. It still probably isn't worth dealing with the FSM
here, though. It's just too much mechanism for too little benefit in a
very rare case. What do you think?

--
Peter Geoghegan

#4Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#3)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-22 17:07:46 -0800, Peter Geoghegan wrote:

Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
non-aggressive VACUUM -- so "at least waiting for one or two pins
during non-aggressive VACUUM" might not have been the best way of
expressing the idea that I wanted to express. The important point is
that _we can make a choice_ about stuff like this dynamically, based
on the observed characteristics of the table, and some general ideas
about the costs and benefits (of waiting or not waiting, or of how
long we want to wait in total, whatever might be important). This
probably just means adding some heuristics that are pretty sensitive
to any reason to not do more work in a non-aggressive VACUUM, without
*completely* balking at doing even a tiny bit more work.

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once

We currently have no infrastructure to wait for an lwlock or pincount for a
limited time. And at least for the former it'd not be easy to add. It may be
worth adding that at some point, but I'm doubtful this is sufficient reason
for nontrivial new infrastructure in very performance sensitive areas.

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't). Whereas for the smaller pgbench tables, every
single VACUUM successfully advances relfrozenxid to a fairly recent
XID (without there ever being an aggressive VACUUM) -- just because
VACUUM needs to visit every page for the smaller tables. While the
accounts table doesn't generally need to have 100% of all pages
touched by VACUUM -- it's more like 95% there. Does that really make
sense, though?

Does what make really sense?

I'm pretty sure that less aggressive VACUUMing (e.g. higher
scale_factor setting) would lead to more aggressive setting of
relfrozenxid here. I'm always suspicious when I see insignificant
differences that lead to significant behavioral differences. Am I
worried over nothing here? Perhaps -- we don't really need to advance
relfrozenxid early with this table/workload anyway. But I'm not so
sure.

I think pgbench_accounts is just a really poor showcase. Most importantly
there's no even slightly longer running transactions that hold down the xid
horizon. But in real workloads thats incredibly common IME. It's also quite
uncommon in real workloads to huge tables in which all records are
updated. It's more common to have value ranges that are nearly static, and a
more heavily changing range.

I think the most interesting cases where using the "measured" horizon will be
advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
modified tables, including completely static ones, too. Using the "measured"
horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
tables, because we'll be able to set a much more recent relfrozenxid.

This is becoming more common with the increased use of partitioning.

The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

Not sure what you mean about scheduling, since vacuum_freeze_table_age
is only in place to make overnight (off hours low activity scripted
VACUUMs) freeze tuples before any autovacuum worker gets the chance
(since the latter may run at a much less convenient time). Sure,
vacuum_freeze_table_age might also force a regular autovacuum worker
to do an aggressive VACUUM -- but I think it's mostly intended for a
manual overnight VACUUM. Not usually very helpful, but also not
harmful.

Oh, wait. I think that you're talking about how autovacuum workers in
particular tend to be affected by this. We launch an av worker that
wants to clean up bloat, but it ends up being aggressive (and maybe
taking way longer), perhaps quite randomly, only due to
vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
that it?

No, not quite. We treat anti-wraparound vacuums as an emergency (including
logging messages, not cancelling). But the only mechanism we have against
anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
not really a "real" mechanism, because it requires an "independent" reason to
vacuum a table.

I've seen cases where anti-wraparound vacuums weren't a problem / never
happend for important tables for a long time, because there always was an
"independent" reason for autovacuum to start doing its thing before the table
got to be autovacuum_freeze_max_age old. But at some point the important
tables started to be big enough that autovacuum didn't schedule vacuums that
got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
vacuums. Then things started to burn, because of the unpaced anti-wrap vacuums
clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
remember the details.

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

What do you mean? Only an autovacuum worker can do an anti-wraparound
VACUUM (which is not quite the same thing as an aggressive VACUUM).

Just that autovacuum should have a mechanism to trigger aggressive vacuums
(i.e. ones that are guaranteed to be able to increase relfrozenxid unless
cancelled) before getting to the "emergency"-ish anti-wraparound state.

Or alternatively that we should have a separate threshold for the "harsher"
anti-wraparound measures.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

I haven't seen any evidence of that myself (with the optimization
added to Postgres 14 by commit 5100010ee4). I still don't understand
why you doubted that work so much. I'm not saying that you're wrong
to; I'm saying that I don't think that I understand your perspective
on it.

I didn't (nor do) doubt that it can be useful - to the contrary, I think the
unconditional index pass was a huge practial issue. I do however think that
there are cases where it can cause trouble. The comment above wasn't meant as
a criticism - just that it seems worth pointing out that one reason we might
encounter a lot of LP_DEAD items is previous vacuums that didn't perform index
cleanup.

What I have seen in my own tests (particularly with BenchmarkSQL) is
that most individual tables either never apply the optimization even
once (because the table reliably has heap pages with many more LP_DEAD
items than the 2%-of-relpages threshold), or will never need to
(because there are precisely zero LP_DEAD items anyway). Remaining
tables that *might* use the optimization tend to not go very long
without actually getting a round of index vacuuming. It's just too
easy for updates (and even aborted xact inserts) to introduce new
LP_DEAD items for us to go long without doing index vacuuming.

I think workloads are a bit more worried than a realistic set of benchmarksk
that one person can run yourself.

I gave you examples of cases that I see as likely being bitten by this,
e.g. when the skipped index cleanup prevents IOS scans. When both the
likely-to-be-modified and likely-to-be-queried value ranges are a small subset
of the entire data, the 2% threshold can prevent vacuum from cleaning up
LP_DEAD entries for a long time. Or when all index scans are bitmap index
scans, and nothing ends up cleaning up the dead index entries in certain
ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
small rollback / non-HOT update rate can start to be really painful.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

Another idea is to truncate the line pointer during pruning (including
opportunistic pruning). Matthias van de Meent has a patch for that.

I'm a bit doubtful that's as important (which is not to say that it's not
worth doing). For a heavily updated table the max space usage of the line
pointer array just isn't as big a factor as ending up with only half the
usable line pointers.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

Is that still true, though? If it turns out that we need to leave it
in, then I can do that. But I'd prefer to wait until we have more
information before making a final decision. Remember, the high level
idea of this whole patch is that we do as much work as possible for
any scanned_pages, which now includes pages that we never successfully
acquired a cleanup lock on. And so we're justified in assuming that
they're exactly equivalent to pages that we did get a cleanup on --
that's now the working assumption. I know that that's not literally
true, but that doesn't mean it's not a useful fiction -- it should be
very close to the truth.

IDK, it seems misleading to me. Small tables with a lot of churn - quite
common - are highly reliant on LP_DEAD entries getting removed or the tiny
table suddenly isn't so tiny anymore. And it's harder to diagnose why the
cleanup isn't happening without knowledge that pages needing cleanup couldn't
be cleaned up due to pins.

If you want to improve the logic so that we only count pages that would have
something to clean up, I'd be happy as well. It doesn't have to mean exactly
what it means today.

+      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+      * the rel_pages used by lazy_scan_prune, from before a possible relation
+      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+      */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

I would argue that we already have one -- vacuum-reltuples.spec. I had
to update its expected output in the patch. I would argue that the
behavioral change (count tuples on a pinned-by-cursor heap page) that
necessitated updating the expected output for the test is an
improvement overall.

I was thinking of truncations, which I don't think vacuum-reltuples.spec
tests.

+     {
+             /* Can safely advance relfrozen and relminmxid, too */
+             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+                        orig_rel_pages);
+             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+                                                     new_rel_allvisible, vacrel->nindexes > 0,
+                                                     FreezeLimit, MultiXactCutoff, false);
+     }

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

There is no question that that makes sense as an optimization -- my
prototype convinced me of that already. But I don't think that it can
simplify anything (not even the call to vac_update_relstats itself, to
actually update relfrozenxid at the end).

Maybe. But we've had quite a few bugs because we ended up changing some detail
of what is excluded in one of the counters, leading to wrong determination
about whether we scanned everything or not.

Fundamentally, this will only work if we decide to only skip all-frozen
pages, which (by definition) only happens within aggressive VACUUMs.

Hm? Or if there's just no runs of all-visible pages of sufficient length, so
we don't end up skipping at all.

You recently said (on the heap-pruning-14-bug thread) that you don't
think it would be practical to always set a page all-frozen when we
see that we're going to set it all-visible -- apparently you feel that
we could never opportunistically freeze early such that all-visible
but not all-frozen pages practically cease to exist. I'm still not
sure why you believe that (though you may be right, or I might have
misunderstood, since it's complicated).

Yes, I think it may not work out to do that. But it's not a very strongly held
opinion.

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

On a table with 40 million individually inserted rows, fully hintbitted via
reads, I see a first VACUUM taking 1.6s and generating 11MB of WAL. A
subsequent VACUUM FREEZE takes 5s and generates 500MB of WAL. That's a quite
large multiplier...

If we ever managed to not have a per-page all-visible flag this'd get even
more extreme, because we'd then not even need to dirty the page for
insert-only pages. But if we want to freeze, we'd need to (unless we just got
rid of freezing).

It would certainly benefit this dynamic relfrozenxid business if it was
possible, though. If we could somehow make that work, then almost every
VACUUM would be able to advance relfrozenxid, independently of
aggressive-ness -- because we wouldn't have any
all-visible-but-not-all-frozen pages to skip (that important detail wouldn't
be left to chance).

Perhaps we can have most of the benefit even without that. If we were to
freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
all-visible but not !all-frozen pages if they were less than x% of the
to-be-scanned data, we should be able to to still increase relfrozenxid in a
lot of cases?

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

With 'reading' do you mean reads-from-os, or just references to buffer
contents?

Since it is often legit to read a heap page without a buffer lock
(only a pin), I can't see why BufferGetPage() without a buffer lock
shouldn't also be okay -- if anything it seems safer. I think that I
would agree with you if it wasn't for that inconsistency (which is
rather a big "if", to be sure -- even for me).

At least for heap it's rarely legit to read buffer contents via
BufferGetPage() without a lock. It's legit to read data at already-determined
offsets, but you can't look at much other than the tuple contents.

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

Compared to what? Do we really want to invent a new kind of DEAD tuple
(e.g., to report on), just to handle this rare case?

When looking at logs I use the
"tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"
line to see whether the user is likely to have issues around an old
transaction / slot / prepared xact preventing cleanup. If new_dead_tuples
doesn't identify those cases anymore that's not reliable anymore.

I accept that this code is lying about the tuples being RECENTLY_DEAD,
kind of. But isn't it still strictly closer to the truth, compared to
HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
counting it at all.

I don't see how it's closer at all. There's imo a significant difference
between not being able to remove tuples because of the xmin horizon, and not
being able to remove it because we couldn't get a cleanup lock.

Greetings,

Andres Freund

In reply to: Andres Freund (#4)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Nov 22, 2021 at 9:49 PM Andres Freund <andres@anarazel.de> wrote:

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once

We currently have no infrastructure to wait for an lwlock or pincount for a
limited time. And at least for the former it'd not be easy to add. It may be
worth adding that at some point, but I'm doubtful this is sufficient reason
for nontrivial new infrastructure in very performance sensitive areas.

It was a hypothetical example. To be more practical about it: it seems
likely that we won't really benefit from waiting some amount of time
(not forever) for a cleanup lock in non-aggressive VACUUM, once we
have some of the relfrozenxid stuff we've talked about in place. In a
world where we're smarter about advancing relfrozenxid in
non-aggressive VACUUMs, the choice between waiting for a cleanup lock,
and not waiting (but also not advancing relfrozenxid at all) matters
less -- it's no longer a binary choice.

It's no longer a binary choice because we will have done away with the
current rigid way in which our new relfrozenxid for the relation is
either FreezeLimit, or nothing at all. So far we've only talked about
the case where we can update relfrozenxid with a value that happens to
be much newer than FreezeLimit. If we can do that, that's great. But
what about setting relfrozenxid to an *older* value than FreezeLimit
instead (in a non-aggressive VACUUM)? That's also pretty good! There
is still a decent chance that the final "suboptimal" relfrozenxid that
we determine can be safely set in pg_class at the end of our VACUUM
will still be far more recent than the preexisting relfrozenxid.
Especially with larger tables.

Advancing relfrozenxid should be thought of as a totally independent
thing to freezing tuples, at least in vacuumlazy.c itself. That's
kinda the case today, even, but *explicitly* decoupling advancing
relfrozenxid from actually freezing tuples seems like a good high
level goal for this project.

Remember, FreezeLimit is derived from vacuum_freeze_min_age in the
obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age
GUC/reloption setting. I'm pretty sure that this means that making
autovacuum freeze tuples more aggressively (by reducing
vacuum_freeze_min_age) could have the perverse effect of making
non-aggressive VACUUMs less likely to advance relfrozenxid -- which is
exactly backwards. This effect could easily be missed, even by expert
users, since there is no convenient instrumentation that shows how and
when relfrozenxid is advanced.

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't).

Does that really make
sense, though?

Does what make really sense?

Well, my accounts table example wasn't a particularly good one (it was
a conveniently available example). I am now sure that you got the
point I was trying to make here already, based on what you go on to
say about non-aggressive VACUUMs optionally *not* skipping
all-visible-not-all-frozen heap pages in the hopes of advancing
relfrozenxid earlier (more on that idea below, in my response).

On reflection, the simplest way of expressing the same idea is what I
just said about decoupling (decoupling advancing relfrozenxid from
freezing).

I think pgbench_accounts is just a really poor showcase. Most importantly
there's no even slightly longer running transactions that hold down the xid
horizon. But in real workloads thats incredibly common IME. It's also quite
uncommon in real workloads to huge tables in which all records are
updated. It's more common to have value ranges that are nearly static, and a
more heavily changing range.

I agree.

I think the most interesting cases where using the "measured" horizon will be
advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
modified tables, including completely static ones, too. Using the "measured"
horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
tables, because we'll be able to set a much more recent relfrozenxid.

That's probably true in practice -- but who knows these days, with the
autovacuum_vacuum_insert_scale_factor stuff? Either way I see no
reason to emphasize that case in the design itself. The "decoupling"
concept now seems like the key design-level concept -- everything else
follows naturally from that.

This is becoming more common with the increased use of partitioning.

Also with bulk loading. There could easily be a tiny number of
distinct XIDs that are close together in time, for many many rows --
practically one XID, or even exactly one XID.

No, not quite. We treat anti-wraparound vacuums as an emergency (including
logging messages, not cancelling). But the only mechanism we have against
anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
not really a "real" mechanism, because it requires an "independent" reason to
vacuum a table.

Got it.

I've seen cases where anti-wraparound vacuums weren't a problem / never
happend for important tables for a long time, because there always was an
"independent" reason for autovacuum to start doing its thing before the table
got to be autovacuum_freeze_max_age old. But at some point the important
tables started to be big enough that autovacuum didn't schedule vacuums that
got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
vacuums.

Right. Not just because they were big; also because autovacuum runs at
geometric intervals -- the final reltuples from last time is used to
determine the point at which av runs this time. This might make sense,
or it might not make any sense -- it all depends (mostly on index
stuff).

Then things started to burn, because of the unpaced anti-wrap vacuums
clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
remember the details.

Non-cancelling anti-wraparound VACUUMs that (all of a sudden) cause
chaos because they interact badly with automated DDL is one I've seen
several times -- I'm sure you have too. That was what the Manta/Joyent
blogpost I referenced upthread went into.

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).

These are among our most important challenges IMV.

Just that autovacuum should have a mechanism to trigger aggressive vacuums
(i.e. ones that are guaranteed to be able to increase relfrozenxid unless
cancelled) before getting to the "emergency"-ish anti-wraparound state.

Maybe, but that runs into the problem of needing another GUC that
nobody will ever be able to remember the name of. I consider the idea
of adding a variety of measures that make non-aggressive VACUUM much
more likely to advance relfrozenxid in practice to be far more
promising.

Or alternatively that we should have a separate threshold for the "harsher"
anti-wraparound measures.

Or maybe just raise the default of autovacuum_freeze_max_age, which
many people don't change? That might be a lot safer than it once was.
Or will be, once we manage to teach VACUUM to advance relfrozenxid
more often in non-aggressive VACUUMs on Postgres 15. Imagine a world
in which we have that stuff in place, as well as related enhancements
added in earlier releases: autovacuum_vacuum_insert_scale_factor, the
freezemap, and the wraparound failsafe.

These add up to a lot; with all of that in place, the risk we'd be
introducing by increasing the default value of
autovacuum_freeze_max_age would be *far* lower than the risk of making
the same change back in 2006. I bring up 2006 because it was the year
that commit 48188e1621 added autovacuum_freeze_max_age -- the default
hasn't changed since that time.

I think workloads are a bit more worried than a realistic set of benchmarksk
that one person can run yourself.

No question. I absolutely accept that I only have to miss one
important detail with something like this -- that just goes with the
territory. Just saying that I have yet to see any evidence that the
bypass-indexes behavior really hurt anything. I do take the idea that
I might have missed something very seriously, despite all this.

I gave you examples of cases that I see as likely being bitten by this,
e.g. when the skipped index cleanup prevents IOS scans. When both the
likely-to-be-modified and likely-to-be-queried value ranges are a small subset
of the entire data, the 2% threshold can prevent vacuum from cleaning up
LP_DEAD entries for a long time. Or when all index scans are bitmap index
scans, and nothing ends up cleaning up the dead index entries in certain
ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
small rollback / non-HOT update rate can start to be really painful.

That does seem possible. But I consider it very unlikely to appear as
a regression caused by the bypass mechanism itself -- not in any way
that was consistent over time. As far as I can tell, autovacuum
scheduling just doesn't operate at that level of precision, and never
has.

I have personally observed that ANALYZE does a very bad job at
noticing LP_DEAD items in tables/workloads where LP_DEAD items (not
DEAD tuples) tend to concentrate [1]/messages/by-id/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com -- Peter Geoghegan. The whole idea that ANALYZE
should count these items as if they were normal tuples seems pretty
bad to me.

Put it this way: imagine you run into trouble with the bypass thing,
and then you opt to disable it on that table (using the INDEX_CLEANUP
reloption). Why should this step solve the problem on its own? In
order for that to work, VACUUM would have to have to know to be very
aggressive about these LP_DEAD items. But there is good reason to
believe that it just won't ever notice them, as long as ANALYZE is
expected to provide reliable statistics that drive autovacuum --
they're just too concentrated for the block-based approach to truly
work.

I'm not minimizing the risk. Just telling you my thoughts on this.

I'm a bit doubtful that's as important (which is not to say that it's not
worth doing). For a heavily updated table the max space usage of the line
pointer array just isn't as big a factor as ending up with only half the
usable line pointers.

Agreed; by far the best chance we have of improving the line pointer
bloat situation is preventing it in the first place, by increasing
MaxHeapTuplesPerPage. Once we actually do that, our remaining options
are going to be much less helpful -- then it really is mostly just up
to VACUUM.

And it's harder to diagnose why the
cleanup isn't happening without knowledge that pages needing cleanup couldn't
be cleaned up due to pins.

If you want to improve the logic so that we only count pages that would have
something to clean up, I'd be happy as well. It doesn't have to mean exactly
what it means today.

It seems like what you really care about here are remaining cases
where our inability to acquire a cleanup lock has real consequences --
you want to hear about it when it happens, however unlikely it may be.
In other words, you want to keep something in log_autovacuum_* that
indicates that "less than the expected amount of work was completed"
due to an inability to acquire a cleanup lock. And so for you, this is
a question of keeping instrumentation that might still be useful, not
a question of how we define things fundamentally, at the design level.

Sound right?

If so, then this proposal might be acceptable to you:

* Remaining DEAD tuples with storage (though not LP_DEAD items from
previous opportunistic pruning) will get counted separately in the
lazy_scan_noprune (no cleanup lock) path. Also count the total number
of distinct pages that were found to contain one or more such DEAD
tuples.

* These two new counters will be reported on their own line in the log
output, though only in the cases where we actually have any such
tuples -- which will presumably be much rarer than simply failing to
get a cleanup lock (that's now no big deal at all, because we now
consistently do certain cleanup steps, and because FreezeLimit isn't
the only viable thing that we can set relfrozenxid to, at least in the
non-aggressive case).

* There is still a limited sense in which the same items get counted
as RECENTLY_DEAD -- though just those aspects that make the overall
design simpler. So the helpful aspects of this are still preserved.

We only need to tell pgstat_report_vacuum() that these items are
"deadtuples" (remaining dead tuples). That can work by having its
caller add a new int64 counter (same new tuple-based counter used for
the new log line) to vacrel->new_dead_tuples. We'd also add the same
new tuple counter in about the same way at the point where we
determine a final vacrel->new_rel_tuples.

So we wouldn't really be treating anything as RECENTLY_DEAD anymore --
pgstat_report_vacuum() and vacrel->new_dead_tuples don't specifically
expect anything about RECENTLY_DEAD-ness already.

I was thinking of truncations, which I don't think vacuum-reltuples.spec
tests.

Got it. I'll look into that for v2.

Maybe. But we've had quite a few bugs because we ended up changing some detail
of what is excluded in one of the counters, leading to wrong determination
about whether we scanned everything or not.

Right. But let me just point out that my whole approach is to make
that impossible, by not needing to count pages, except in
scanned_pages (and in frozenskipped_pages + rel_pages). The processing
performed for any page that we actually read during VACUUM should be
uniform (or practically uniform), by definition. With minimal fudging
in the cleanup lock case (because we mostly do the same work there
too).

There should be no reason for any more page counters now, except for
non-critical instrumentation. For example, if you want to get the
total number of pages skipped via the visibility map (not just
all-frozen pages), then you simply subtract scanned_pages from
rel_pages.

Fundamentally, this will only work if we decide to only skip all-frozen
pages, which (by definition) only happens within aggressive VACUUMs.

Hm? Or if there's just no runs of all-visible pages of sufficient length, so
we don't end up skipping at all.

Of course. But my point was: who knows when that'll happen?

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

This sounds like an objection to one specific heuristic, and not an
objection to the general idea. The only essential part is
"opportunistic freezing during vacuum, when the cost is clearly very
low, and the benefit is probably high". And so it now seems you were
making a far more limited statement than I first believed.

Obviously many variations are possible -- there is a spectrum.
Example: a heuristic that makes VACUUM notice when it is going to
freeze at least one tuple on a page, iff the page will be marked
all-visible in any case -- we should instead freeze every tuple on the
page, and mark the page all-frozen, batching work (could account for
LP_DEAD items here too, not counting them on the assumption that
they'll become LP_UNUSED during the second heap pass later on).

If we see these conditions, then the likely explanation is that the
tuples on the heap page happen to have XIDs that are "split" by the
not-actually-important FreezeLimit cutoff, despite being essentially
similar in any way that matters.

If you want to make the same heuristic more conservative: only do this
when no existing tuples are frozen, since that could be taken as a
sign of the original heuristic not quite working on the same heap page
at an earlier stage.

I suspect that even very conservative versions of the same basic idea
would still help a lot.

Perhaps we can have most of the benefit even without that. If we were to
freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
all-visible but not !all-frozen pages if they were less than x% of the
to-be-scanned data, we should be able to to still increase relfrozenxid in a
lot of cases?

I bet that's true. I like that idea.

If we had this policy, then the number of "extra"
visited-in-non-aggressive-vacuum pages (all-visible but not yet
all-frozen pages) could be managed over time through more
opportunistic freezing. This might make it work even better.

These all-visible (but not all-frozen) heap pages could be considered
"tenured", since they have survived at least one full VACUUM cycle
without being unset. So why not also freeze them based on the
assumption that they'll probably stay that way forever? There won't be
so many of the pages when we do this anyway, by definition -- since
we'd have a heuristic that limited the total number (say to no more
than 10% of the total relation size, something like that).

We're smoothing out the work that currently takes place all together
during an aggressive VACUUM this way.

Moreover, there is perhaps a good chance that the total number of
all-visible-not all-frozen heap pages will *stay* low over time, as a
result of this policy actually working -- there may be a virtuous
cycle that totally prevents us from getting an aggressive VACUUM even
once.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

With 'reading' do you mean reads-from-os, or just references to buffer
contents?

The latter.

[1]: /messages/by-id/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#6Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#5)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-23 17:01:20 -0800, Peter Geoghegan wrote:

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

This sounds like an objection to one specific heuristic, and not an
objection to the general idea.

I understood you to propose that we do not have separate frozen and
all-visible states. Which I think will be problematic, because of scenarios
like the above.

The only essential part is "opportunistic freezing during vacuum, when the
cost is clearly very low, and the benefit is probably high". And so it now
seems you were making a far more limited statement than I first believed.

I'm on board with freezing when we already dirty out the page, and when doing
so doesn't cause an additional FPI. And I don't think I've argued against that
in the past.

These all-visible (but not all-frozen) heap pages could be considered
"tenured", since they have survived at least one full VACUUM cycle
without being unset. So why not also freeze them based on the
assumption that they'll probably stay that way forever?

Because it's a potentially massive increase in write volume? E.g. if you have
a insert-only workload, and you discard old data by dropping old partitions,
this will often add yet another rewrite, despite your data likely never
getting old enough to need to be frozen.

Given that we often immediately need to start another vacuum just when one
finished, because the vacuum took long enough to reach thresholds of vacuuming
again, I don't think the (auto-)vacuum count is a good proxy.

Maybe you meant this as a more limited concept, i.e. only doing so when the
percentage of all-visible but not all-frozen pages is small?

We could perhaps do better with if we had information about the system-wide
rate of xid throughput and how often / how long past vacuums of a table took.

Greetings,

Andres Freund

In reply to: Peter Geoghegan (#5)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Nov 23, 2021 at 5:01 PM Peter Geoghegan <pg@bowt.ie> wrote:

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).

These are among our most important challenges IMV.

I haven't had time to work through any of your feedback just yet --
though it's certainly a priority for. I won't get to it until I return
home from PGConf NYC next week.

Even still, here is a rebased v2, just to fix the bitrot. This is just
a courtesy to anybody interested in the patch.

--
Peter Geoghegan

Attachments:

v2-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v2-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload+500-302
In reply to: Peter Geoghegan (#7)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Nov 30, 2021 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

I haven't had time to work through any of your feedback just yet --
though it's certainly a priority for. I won't get to it until I return
home from PGConf NYC next week.

Attached is v3, which works through most of your (Andres') feedback.

Changes in v3:

* While the first patch still gets rid of the "pinskipped_pages"
instrumentation, the second patch adds back a replacement that's
better targeted: it tracks and reports "missed_dead_tuples". This
means that log output will show the number of fully DEAD tuples with
storage that could not be pruned away due to the fact that that would
have required waiting for a cleanup lock. But we *don't* generally
report the number of pages that we couldn't get a cleanup lock on,
because that in itself doesn't mean that we skipped any useful work
(which is very much the point of all of the refactoring in the first
patch).

* We now have FSM processing in the lazy_scan_noprune case, which more
or less matches the standard lazy_scan_prune case.

* Many small tweaks, based on suggestions from Andres, and other
things that I noticed.

* Further simplification of the "consider skipping pages using
visibility map" logic -- now we always don't skip the last block in
the relation, without calling should_attempt_truncation() to make sure
we have a reason.

Note that this means that we'll always read the final page during
VACUUM, even when doing so is provably unhelpful. I'd prefer to keep
the code that deals with skipping pages using the visibility map as
simple as possible. There isn't much downside to always doing that
once my refactoring is in place: there is no risk that we'll wait for
a cleanup lock (on the final page in the rel) for no good reason.
We're only wasting one page access, at most.

(I'm not 100% sure that this is the right trade-off, actually, but
it's at least worth considering.)

Not included in v3:

* Still haven't added the isolation test for rel truncation, though
it's on my TODO list.

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

I'm not ready to post that today, but it shouldn't take too much
longer to be good enough to review.

Thanks
--
Peter Geoghegan

Attachments:

v3-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v3-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload+541-325
v3-0002-Improve-log_autovacuum_min_duration-output.patchapplication/octet-stream; name=v3-0002-Improve-log_autovacuum_min_duration-output.patchDownload+84-31
In reply to: Peter Geoghegan (#8)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

Attached is v4 of the patch series, which now includes this
optimization, broken out into its own patch. In addition, it includes
a prototype of opportunistic freezing.

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

There is a kind of virtuous circle here, where successive
non-aggressive autovacuums never fall behind on freezing, and so never
fail to advance relfrozenxid (there are never any
all_visible-but-not-all_frozen pages, and we can cope with not
acquiring a cleanup lock quite well). When VACUUM chooses to freeze a
tuple opportunistically, the frozen XIDs naturally cannot hold back
the final safe relfrozenxid for the relation. Opportunistic freezing
avoids setting all_visible (without setting all_frozen) in the
visibility map. It's impossible for VACUUM to just set a page to
all_visible now, which seems like an essential part of making a decent
amount of relfrozenxid advancement take place in almost every VACUUM
operation.

Here is an example of what I'm calling a virtuous circle -- all
pgbench_history autovacuums look like this with the patch applied:

LOG: automatic vacuum of table "regression.public.pgbench_history":
index scans: 0
pages: 0 removed, 35503 remain, 31930 skipped using visibility map
(89.94% of total)
tuples: 0 removed, 5568687 remain (547976 newly frozen), 0 are
dead but not yet removable
removal cutoff: oldest xmin was 5570281, which is now 1177 xact IDs behind
relfrozenxid: advanced by 546618 xact IDs, new value: 5565226
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
I/O timings: read: 0.003 ms, write: 0.000 ms
avg read rate: 0.068 MB/s, avg write rate: 0.068 MB/s
buffer usage: 7169 hits, 1 misses, 1 dirtied
WAL usage: 7043 records, 1 full page images, 6974928 bytes
system usage: CPU: user: 0.10 s, system: 0.00 s, elapsed: 0.11 s

Note that relfrozenxid is almost the same as oldest xmin here. Note also
that the log output shows the number of tuples newly frozen. I see the
same general trends with *every* pgbench_history autovacuum. Actually,
with every autovacuum. The history table tends to have ultra-recent
relfrozenxid values, which isn't always what we see, but that
difference may not matter. As far as I can tell, we can expect
practically every table to have a relfrozenxid that would (at least
traditionally) be considered very safe/recent. Barring weird
application issues that make it totally impossible to advance
relfrozenxid (e.g., idle cursors that hold onto a buffer pin forever),
it seems as if relfrozenxid will now steadily march forward. Sure,
relfrozenxid advancement might be held by the occasional inability to
acquire a cleanup lock, but the effect isn't noticeable over time;
what are the chances that a cleanup lock won't be available on the
same page (with the same old XID) more than once or twice? The odds of
that happening become astronomically tiny, long before there is any
real danger (barring pathological cases).

In the past, we've always talked about opportunistic freezing as a way
of avoiding re-dirtying heap pages during successive VACUUM operations
-- especially as a way of lowering the total volume of WAL. While I
agree that that's important, I have deliberately ignored it for now,
preferring to focus on the relfrozenxid stuff, and smoothing out the
cost of freezing (avoiding big shocks from aggressive/anti-wraparound
autovacuums). I care more about stable performance than absolute
throughput, but even still I believe that the approach I've taken to
opportunistic freezing is probably too aggressive. But it's dead
simple, which will make it easier to understand and discuss the issue
of central importance. It may be possible to optimize the WAL-logging
used during freezing, getting the cost down to the point where
freezing early just isn't a concern. The current prototype adds extra
WAL overhead, to be sure, but even that's not wildly unreasonable (you
make some of it back on FPIs, depending on the workload -- especially
with tables like pgbench_history, where delaying freezing is a total loss).

--
Peter Geoghegan

Attachments:

v4-0002-Improve-log_autovacuum_min_duration-output.patchapplication/x-patch; name=v4-0002-Improve-log_autovacuum_min_duration-output.patchDownload+91-32
v4-0004-Decouple-advancing-relfrozenxid-from-freezing.patchapplication/x-patch; name=v4-0004-Decouple-advancing-relfrozenxid-from-freezing.patchDownload+228-79
v4-0005-Prototype-of-opportunistic-freezing.patchapplication/x-patch; name=v4-0005-Prototype-of-opportunistic-freezing.patchDownload+79-9
v4-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/x-patch; name=v4-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload+516-308
v4-0003-Simplify-vacuum_set_xid_limits-signature.patchapplication/x-patch; name=v4-0003-Simplify-vacuum_set_xid_limits-signature.patchDownload+79-97
#10Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Geoghegan (#9)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Dec 16, 2021 at 5:27 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

Attached is v4 of the patch series, which now includes this
optimization, broken out into its own patch. In addition, it includes
a prototype of opportunistic freezing.

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

Great!

I've looked at 0001 patch and here are some comments:

@@ -535,8 +540,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,

xidFullScanLimit);
aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,

                   mxactFullScanLimit);
+       skipwithvm = true;
        if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+       {
+               /*
+                * Force aggressive mode, and disable skipping blocks using the
+                * visibility map (even those set all-frozen)
+                */
                aggressive = true;
+               skipwithvm = false;
+       }

vacrel = (LVRelState *) palloc0(sizeof(LVRelState));

@@ -544,6 +557,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->rel = rel;
vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
&vacrel->indrels);
+ vacrel->aggressive = aggressive;
vacrel->failsafe_active = false;
vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

---
                        /*
-                        * The current block is potentially skippable;
if we've seen a
-                        * long enough run of skippable blocks to
justify skipping it, and
-                        * we're not forced to check it, then go ahead and skip.
-                        * Otherwise, the page must be at least
all-visible if not
-                        * all-frozen, so we can set
all_visible_according_to_vm = true.
+                        * The current page can be skipped if we've
seen a long enough run
+                        * of skippable blocks to justify skipping it
-- provided it's not
+                        * the last page in the relation (according to
rel_pages/nblocks).
+                        *
+                        * We always scan the table's last page to
determine whether it
+                        * has tuples or not, even if it would
otherwise be skipped
+                        * (unless we're skipping every single page in
the relation). This
+                        * avoids having lazy_truncate_heap() take
access-exclusive lock
+                        * on the table to attempt a truncation that just fails
+                        * immediately because there are tuples on the
last page.
                         */
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)

Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In reply to: Masahiko Sawada (#10)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

Great!

Maybe this is a good time to revisit basic questions about VACUUM. I
wonder if we can get rid of some of the GUCs for VACUUM now.

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

We probably shouldn't be using any units, but using XIDs "feels wrong"
to me. Even with my patch, it is theoretically possible that we won't
be able to advance relfrozenxid very much, because we cannot get a
cleanup lock on one single heap page with one old XID. But even in
this extreme case, how relevant is the "age" of this old XID, really?
What really matters is whether or not we can advance relfrozenxid in
time (with time to spare). And so the wraparound risk of the system is
not affected all that much by the age of the single oldest XID. The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Using XIDs makes sense for things like autovacuum_freeze_max_age,
because there we have to worry about wraparound and relfrozenxid
(whether or not we like it). But with this patch, and with everything
else (the failsafe, insert-driven autovacuums, everything we've done
over the last several years) I think that it might be time to increase
the autovacuum_freeze_max_age default. Maybe even to something as high
as 800 million transaction IDs, but certainly to 400 million. What do
you think? (Maybe don't answer just yet, something to think about.)

+ vacrel->aggressive = aggressive;
vacrel->failsafe_active = false;
vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

Agreed -- it's slightly better that way. Will change this.

*/
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)

Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?

My goal here was to keep the behavior from commit e8429082, "Avoid
useless truncation attempts during VACUUM", while simplifying things
around skipping heap pages via the visibility map (including removing
the FORCE_CHECK_PAGE() macro). Of course you're right that this
particular change that you have highlighted does change the behavior a
little -- now we will always treat the final page as a "scanned page",
except perhaps when 100% of all pages in the relation are skipped
using the visibility map.

This was a deliberate choice (and perhaps even a good choice!). I
think that avoiding accessing the last heap page like this isn't worth
the complexity. Note that we may already access heap pages (making
them "scanned pages") despite the fact that we know it's unnecessary:
the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
even try to avoid wasting CPU cycles on these
not-skipped-but-skippable pages). So I think that the performance cost
for the last page isn't going to be noticeable.

However, now that I think about it, I wonder...what do you think of
SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
after the original visibility map implementation was committed in
2009. The idea that it helps us to advance relfrozenxid outside of
aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
might no longer matter with the patch -- because now we won't ever set
a page all-visible but not all-frozen. Plus the idea that we need to
do all this work just to get readahead from the OS
seems...questionable.

--
Peter Geoghegan

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Geoghegan (#11)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Dec 18, 2021 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

Great!

Maybe this is a good time to revisit basic questions about VACUUM. I
wonder if we can get rid of some of the GUCs for VACUUM now.

Can we fully get rid of vacuum_freeze_table_age?

Does it mean that a vacuum always is an aggressive vacuum? If
opportunistic freezing works well on all tables, we might no longer
need vacuum_freeze_table_age. But I’m not sure that’s true since the
cost of freezing tuples is not 0.

We probably shouldn't be using any units, but using XIDs "feels wrong"
to me. Even with my patch, it is theoretically possible that we won't
be able to advance relfrozenxid very much, because we cannot get a
cleanup lock on one single heap page with one old XID. But even in
this extreme case, how relevant is the "age" of this old XID, really?
What really matters is whether or not we can advance relfrozenxid in
time (with time to spare). And so the wraparound risk of the system is
not affected all that much by the age of the single oldest XID. The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

I think that that's true for (mostly) static tables. But regarding
constantly-updated tables, since autovacuum runs based on the number
of garbage tuples (or inserted tuples) and how old the relfrozenxid is
if an autovacuum could not advance the relfrozenxid because it could
not get a cleanup lock on the page that has the single oldest XID,
it's likely that when autovacuum runs next time it will have to
process other pages too since the page will get dirty enough.

It might be a good idea that we remember pages where we could not get
a cleanup lock somewhere and revisit them after index cleanup. While
revisiting the pages, we don’t prune the page but only freeze tuples.

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Why is it obsolete now? I guess that it's still valid depending on the
cases, for example, heavily updated tables.

Using XIDs makes sense for things like autovacuum_freeze_max_age,
because there we have to worry about wraparound and relfrozenxid
(whether or not we like it). But with this patch, and with everything
else (the failsafe, insert-driven autovacuums, everything we've done
over the last several years) I think that it might be time to increase
the autovacuum_freeze_max_age default. Maybe even to something as high
as 800 million transaction IDs, but certainly to 400 million. What do
you think? (Maybe don't answer just yet, something to think about.)

I don’t have an objection to increasing autovacuum_freeze_max_age for
now. One of my concerns with anti-wraparound vacuums is that too many
tables (or several large tables) will reach autovacuum_freeze_max_age
at once, using up autovacuum slots and preventing autovacuums from
being launched on tables that are heavily being updated. Given these
works, expanding the gap between vacuum_freeze_table_age and
autovacuum_freeze_max_age would have better chances for the tables to
advance its relfrozenxid by an aggressive vacuum instead of an
anti-wraparound-aggressive vacuum. 400 million seems to be a good
start.

+ vacrel->aggressive = aggressive;
vacrel->failsafe_active = false;
vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

Agreed -- it's slightly better that way. Will change this.

*/
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)

Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?

My goal here was to keep the behavior from commit e8429082, "Avoid
useless truncation attempts during VACUUM", while simplifying things
around skipping heap pages via the visibility map (including removing
the FORCE_CHECK_PAGE() macro). Of course you're right that this
particular change that you have highlighted does change the behavior a
little -- now we will always treat the final page as a "scanned page",
except perhaps when 100% of all pages in the relation are skipped
using the visibility map.

This was a deliberate choice (and perhaps even a good choice!). I
think that avoiding accessing the last heap page like this isn't worth
the complexity. Note that we may already access heap pages (making
them "scanned pages") despite the fact that we know it's unnecessary:
the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
even try to avoid wasting CPU cycles on these
not-skipped-but-skippable pages). So I think that the performance cost
for the last page isn't going to be noticeable.

Agreed.

However, now that I think about it, I wonder...what do you think of
SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
after the original visibility map implementation was committed in
2009. The idea that it helps us to advance relfrozenxid outside of
aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
might no longer matter with the patch -- because now we won't ever set
a page all-visible but not all-frozen. Plus the idea that we need to
do all this work just to get readahead from the OS
seems...questionable.

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In reply to: Masahiko Sawada (#12)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Dec 20, 2021 at 8:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Can we fully get rid of vacuum_freeze_table_age?

Does it mean that a vacuum always is an aggressive vacuum?

No. Just somewhat more like one. Still no waiting for cleanup locks,
though. Also, autovacuum is still cancelable (that's technically from
anti-wraparound VACUUM, but you know what I mean). And there shouldn't
be a noticeable difference in terms of how many blocks can be skipped
using the VM.

If opportunistic freezing works well on all tables, we might no longer
need vacuum_freeze_table_age. But I’m not sure that’s true since the
cost of freezing tuples is not 0.

That's true, of course, but right now the only goal of opportunistic
freezing is to advance relfrozenxid in every VACUUM. It needs to be
shown to be worth it, of course. But let's assume that it is worth it,
for a moment (perhaps only because we optimize freezing itself in
passing) -- then there is little use for vacuum_freeze_table_age, that
I can see.

I think that that's true for (mostly) static tables. But regarding
constantly-updated tables, since autovacuum runs based on the number
of garbage tuples (or inserted tuples) and how old the relfrozenxid is
if an autovacuum could not advance the relfrozenxid because it could
not get a cleanup lock on the page that has the single oldest XID,
it's likely that when autovacuum runs next time it will have to
process other pages too since the page will get dirty enough.

I'm not arguing that the age of the single oldest XID is *totally*
irrelevant. Just that it's typically much less important than the
total amount of work we'd have to do (freezing) to be able to advance
relfrozenxid.

In any case, the extreme case where we just cannot get a cleanup lock
on one particular page with an old XID is probably very rare.

It might be a good idea that we remember pages where we could not get
a cleanup lock somewhere and revisit them after index cleanup. While
revisiting the pages, we don’t prune the page but only freeze tuples.

Maybe, but I think that it would make more sense to not use
FreezeLimit for that at all. In an aggressive VACUUM (where we might
actually have to wait for a cleanup lock), why should we wait once the
age is over vacuum_freeze_min_age (usually 50 million XIDs)? The
official answer is "because we need to advance relfrozenxid". But why
not accept a much older relfrozenxid that is still sufficiently
young/safe, in order to avoid waiting for a cleanup lock?

In other words, what if our approach of "being diligent about
advancing relfrozenxid" makes the relfrozenxid problem worse, not
better? The problem with "being diligent" is that it is defined by
FreezeLimit (which is more or less the same thing as
vacuum_freeze_min_age), which is supposed to be about which tuples we
will freeze. That's a very different thing to how old relfrozenxid
should be or can be (after an aggressive VACUUM finishes).

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Why is it obsolete now? I guess that it's still valid depending on the
cases, for example, heavily updated tables.

Because after the 9.6 freezemap work we'll often set the all-visible
bit in the VM, but not the all-frozen bit (unless we have the
opportunistic freezing patch applied, which specifically avoids that).
When that happens, affected heap pages will still have
older-than-vacuum_freeze_min_age-XIDs after VACUUM runs, until we get
to an aggressive VACUUM. There could be many VACUUMs before the
aggressive VACUUM.

This "freezing cliff" seems like it might be a big problem, in
general. That's what I'm trying to address here.

Either way, the system doesn't really respect vacuum_freeze_min_age in
the way that it did before 9.6 -- which is what I meant by "obsolete".

I don’t have an objection to increasing autovacuum_freeze_max_age for
now. One of my concerns with anti-wraparound vacuums is that too many
tables (or several large tables) will reach autovacuum_freeze_max_age
at once, using up autovacuum slots and preventing autovacuums from
being launched on tables that are heavily being updated.

I think that the patch helps with that, actually -- there tends to be
"natural variation" in the relfrozenxid age of each table, which comes
from per-table workload characteristics.

Given these
works, expanding the gap between vacuum_freeze_table_age and
autovacuum_freeze_max_age would have better chances for the tables to
advance its relfrozenxid by an aggressive vacuum instead of an
anti-wraparound-aggressive vacuum. 400 million seems to be a good
start.

The idea behind getting rid of vacuum_freeze_table_age (not to be
confused by the other idea about getting rid of vacuum_freeze_min_age)
is this: with the patch series, we only tend to get an anti-wraparound
VACUUM in extreme and relatively rare cases. For example, we will get
aggressive anti-wraparound VACUUMs on tables that *never* grow, but
constantly get HOT updates (e.g. the pgbench_accounts table with heap
fill factor reduced to 90). We won't really be able to use the VM when
this happens, either.

With tables like this -- tables that still get aggressive VACUUMs --
maybe the patch doesn't make a huge difference. But that's truly the
extreme case -- that is true only because there is already zero chance
of there being a non-aggressive VACUUM. We'll get aggressive
anti-wraparound VACUUMs every time we reach autovacuum_freeze_max_age,
again and again -- no change, really.

But since it's only these extreme cases that continue to get
aggressive VACUUMs, why do we still need vacuum_freeze_table_age? It
helps right now (without the patch) by "escalating" a regular VACUUM
to an aggressive one. But the cases that we still expect an aggressive
VACUUM (with the patch) are the cases where there is zero chance of
that happening. Almost by definition.

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

That is the big question for this patch.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#13)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Dec 20, 2021 at 9:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

That is the big question for this patch.

Attached is a mechanical rebase of the patch series. This new version
just fixes bitrot, caused by Masahiko's recent lazyvacuum.c
refactoring work. In other words, this revision has no significant
changes compared to the v4 that I posted back in late December -- just
want to keep CFTester green.

I still have plenty of work to do here. Especially with the final
patch (the v5-0005-* "freeze early" patch), which is generally more
speculative than the other patches. I'm playing catch-up now, since I
just returned from vacation.

--
Peter Geoghegan

Attachments:

v5-0002-Improve-log_autovacuum_min_duration-output.patchapplication/x-patch; name=v5-0002-Improve-log_autovacuum_min_duration-output.patchDownload+91-32
v5-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchapplication/x-patch; name=v5-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchDownload+231-82
v5-0005-Freeze-tuples-early-to-advance-relfrozenxid.patchapplication/x-patch; name=v5-0005-Freeze-tuples-early-to-advance-relfrozenxid.patchDownload+88-9
v5-0003-Simplify-vacuum_set_xid_limits-signature.patchapplication/x-patch; name=v5-0003-Simplify-vacuum_set_xid_limits-signature.patchDownload+79-97
v5-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/x-patch; name=v5-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload+518-307
#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#11)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

I don't see how we can get rid of these. We know that catastrophe will
ensue if we fail to freeze old XIDs for a sufficiently long time ---
where sufficiently long has to do with the number of XIDs that have
been subsequently consumed. So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are. Admittedly, that decision doesn't
need to be made at the start of the vacuum, as we do today. We could
happily skip waiting for a cleanup lock on pages that contain only
newer XIDs, but if there is a page that both contains an old XID and
stays pinned for a long time, we eventually have to sit there and wait
for that pin to be released. And the best way to decide when to switch
to that strategy is really based on the age of that XID, at least as I
see it, because it is the age of that XID reaching 2 billion that is
going to kill us.

I think vacuum_freeze_min_age also serves a useful purpose: it
prevents us from freezing data that's going to be modified again or
even deleted in the near future. Since we can't know the future, we
must base our decision on the assumption that the future will be like
the past: if the page hasn't been modified for a while, then we should
assume it's not likely to be modified again soon; otherwise not. If we
knew the time at which the page had last been modified, it would be
very reasonable to use that here - say, freeze the XIDs if the page
hasn't been touched in an hour, or whatever. But since we lack such
timestamps the XID age is the closest proxy we have.

The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

I don't really see any reason for optimism here. There could be a lot
of unfrozen pages in the relation, and we'd have to troll through all
of those in order to find that single old XID. Moreover, there is
nothing whatsoever to focus autovacuum's attention on that single old
XID rather than anything else. Nothing in the autovacuum algorithm
will cause it to focus its efforts on that single old XID at a time
when there's no pin on the page, or at a time when that XID becomes
the thing that's holding back vacuuming throughout the cluster. A lot
of vacuum problems that users experience today would be avoided if
autovacuum had perfect knowledge of what it ought to be prioritizing
at any given time, or even some knowledge. But it doesn't, and is
often busy fiddling while Rome burns.

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#15)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 6, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

I don't see how we can get rid of these. We know that catastrophe will
ensue if we fail to freeze old XIDs for a sufficiently long time ---
where sufficiently long has to do with the number of XIDs that have
been subsequently consumed.

I don't really disagree with anything you've said, I think. There are
a few subtleties here. I'll try to tease them apart.

I agree that we cannot do without something like vacrel->FreezeLimit
for the foreseeable future -- but the closely related GUC
(vacuum_freeze_min_age) is another matter. Although everything you've
said in favor of the GUC seems true, the GUC is not a particularly
effective (or natural) way of constraining the problem. It just
doesn't make sense as a tunable.

One obvious reason for this is that the opportunistic freezing stuff
is expected to be the thing that usually forces freezing -- not
vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based
cutoff. As you more or less pointed out yourself, we still need
FreezeLimit as a backstop mechanism. But the value of FreezeLimit can
just come from autovacuum_freeze_max_age/2 in all cases (no separate
GUC), or something along those lines. We don't particularly expect the
value of FreezeLimit to matter, at least most of the time. It should
only noticeably affect our behavior during anti-wraparound VACUUMs,
which become rare with the patch (e.g. my pgbench_accounts example
upthread). Most individual tables will never get even one
anti-wraparound VACUUM -- it just doesn't ever come for most tables in
practice.

My big issue with vacuum_freeze_min_age is that it doesn't really work
with the freeze map work in 9.6, which creates problems that I'm
trying to address by freezing early and so on. After all, HEAD (and
all stable branches) can easily set a page to all-visible (but not
all-frozen) in the VM, meaning that the page's tuples won't be
considered for freezing until the next aggressive VACUUM. This means
that vacuum_freeze_min_age is already frequently ignored by the
implementation -- it's conditioned on other things that are practically
impossible to predict.

Curious about your thoughts on this existing issue with
vacuum_freeze_min_age. I am concerned about the "freezing cliff" that
it creates.

So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are.

I agree, but again, it's only a backstop. With the patch we'd have to
be rather unlucky to ever need to wait like this.

What are the chances that we keep failing to freeze an old XID from
one particular page, again and again? My testing indicates that it's a
negligible concern in practice (barring pathological cases with idle
cursors, etc).

I think vacuum_freeze_min_age also serves a useful purpose: it
prevents us from freezing data that's going to be modified again or
even deleted in the near future. Since we can't know the future, we
must base our decision on the assumption that the future will be like
the past: if the page hasn't been modified for a while, then we should
assume it's not likely to be modified again soon; otherwise not.

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

If we
knew the time at which the page had last been modified, it would be
very reasonable to use that here - say, freeze the XIDs if the page
hasn't been touched in an hour, or whatever. But since we lack such
timestamps the XID age is the closest proxy we have.

XID age is a *terrible* proxy. The age of an XID in a tuple header may
advance quickly, even when nobody modifies the same table at all.

I concede that it is true that we are (in some sense) "gambling" by
freezing early -- we may end up freezing a tuple that we subsequently
update anyway. But aren't we also "gambling" by *not* freezing early?
By not freezing, we risk getting into "freezing debt" that will have
to be paid off in one ruinously large installment. I would much rather
"gamble" on something where we can tolerate consistently "losing" than
gamble on something where I cannot ever afford to lose (even if it's
much less likely that I'll lose during any given VACUUM operation).

Besides all this, I think that we have a rather decent chance of
coming out ahead in practice by freezing early. In practice the
marginal cost of freezing early is consistently pretty low.
Cost-control-driven (as opposed to need-driven) freezing is *supposed*
to be cheaper, of course. And like it or not, freezing is really just part of
the cost of storing data using Postgres (for the time being, at least).

The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

I don't really see any reason for optimism here.

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

On second thought I agree that my specific example of 1.5 billion XIDs
was a little too optimistic of me. But 50 million XIDs (i.e. the
vacuum_freeze_min_age default) is too pessimistic. The important point
is that FreezeLimit could plausibly become nothing more than a
backstop mechanism, with the design from the patch series -- something
that typically has no effect on what tuples actually get frozen.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#16)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 6, 2022 at 2:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

I believe that applications tend to update rows according to
predictable patterns. Andy Pavlo made an observation about this at one
point:

https://youtu.be/AD1HW9mLlrg?t=3202

I think that we don't do a good enough job of keeping logically
related tuples (tuples inserted around the same time) together, on the
same original heap page, which motivated a lot of my experiments with
the FSM from last year. Even still, it seems like a good idea for us
to err in the direction of assuming that tuples on the same heap page
are logically related. The tuples should all be frozen together when
possible. And *not* frozen early when the heap page as a whole can't
be frozen (barring cases with one *much* older XID before
FreezeLimit).

--
Peter Geoghegan

#18Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#16)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 6, 2022 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:

One obvious reason for this is that the opportunistic freezing stuff
is expected to be the thing that usually forces freezing -- not
vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based
cutoff. As you more or less pointed out yourself, we still need
FreezeLimit as a backstop mechanism. But the value of FreezeLimit can
just come from autovacuum_freeze_max_age/2 in all cases (no separate
GUC), or something along those lines. We don't particularly expect the
value of FreezeLimit to matter, at least most of the time. It should
only noticeably affect our behavior during anti-wraparound VACUUMs,
which become rare with the patch (e.g. my pgbench_accounts example
upthread). Most individual tables will never get even one
anti-wraparound VACUUM -- it just doesn't ever come for most tables in
practice.

This seems like a weak argument. Sure, you COULD hard-code the limit
to be autovacuum_freeze_max_age/2 rather than making it a separate
tunable, but I don't think it's better. I am generally very skeptical
about the idea of using the same GUC value for multiple purposes,
because it often turns out that the optimal value for one purpose is
different than the optimal value for some other purpose. For example,
the optimal amount of memory for a hash table is likely different than
the optimal amount for a sort, which is why we now have
hash_mem_multiplier. When it's not even the same value that's being
used in both places, but the original value in one place and a value
derived from some formula in the other, the chances of things working
out are even less.

I feel generally that a lot of the argument you're making here
supposes that tables are going to get vacuumed regularly. I agree that
IF tables are being vacuumed on a regular basis, and if as part of
that we always push relfrozenxid forward as far as we can, we will
rarely have a situation where aggressive strategies to avoid
wraparound are required. However, I disagree strongly with the idea
that we can assume that tables will get vacuumed regularly. That can
fail to happen for all sorts of reasons. One of the common ones is a
poor choice of autovacuum configuration. The most common problem in my
experience is a cost limit that is too low to permit the amount of
vacuuming that is actually required, but other kinds of problems like
not enough workers (so tables get starved), too many workers (so the
cost limit is being shared between many processes), autovacuum=off
either globally or on one table (because of ... reasons),
autovacuum_vacuum_insert_threshold = -1 plus not many updates (so
thing ever triggers the vacuum), autovacuum_naptime=1d (actually seen
in the real world! ... and, no, it didn't work well), or stats
collector problems are all possible. We can *hope* that there are
going to be regular vacuums of the table long before wraparound
becomes a danger, but realistically, we better not assume that in our
choice of algorithms, because the real world is a messy place where
all sorts of crazy things happen.

Now, I agree with you in part: I don't think it's obvious that it's
useful to tune vacuum_freeze_table_age. When I advise customers on how
to fix vacuum problems, I am usually telling them to increase
autovacuum_vacuum_cost_limit, possibly also with an increase in
autovacuum_workers; or to increase or decrease
autovacuum_freeze_max_age depending on which problem they have; or
occasionally to adjust settings like autovacuum_naptime. It doesn't
often seem to be necessary to change vacuum_freeze_table_age or, for
that matter, vacuum_freeze_min_age. But if we remove them and then
discover scenarios where tuning them would have been useful, we'll
have no options for fixing PostgreSQL systems in the field. Waiting
for the next major release in such a scenario, or even the next minor
release, is not good. We should be VERY conservative about removing
existing settings if there's any chance that somebody could use them
to tune their way out of trouble.

My big issue with vacuum_freeze_min_age is that it doesn't really work
with the freeze map work in 9.6, which creates problems that I'm
trying to address by freezing early and so on. After all, HEAD (and
all stable branches) can easily set a page to all-visible (but not
all-frozen) in the VM, meaning that the page's tuples won't be
considered for freezing until the next aggressive VACUUM. This means
that vacuum_freeze_min_age is already frequently ignored by the
implementation -- it's conditioned on other things that are practically
impossible to predict.

Curious about your thoughts on this existing issue with
vacuum_freeze_min_age. I am concerned about the "freezing cliff" that
it creates.

So, let's see: if we see a page where the tuples are all-visible and
we seize the opportunity to freeze it, we can spare ourselves the need
to ever visit that page again (unless it gets modified). But if we
only mark it all-visible and leave the freezing for later, the next
aggressive vacuum will have to scan and dirty the page. I'm prepared
to believe that it's worth the cost of freezing the page in that
scenario. We've already dirtied the page and written some WAL and
maybe generated an FPW, so doing the rest of the work now rather than
saving it until later seems likely to be a win. I think it's OK to
behave, in this situation, as if vacuum_freeze_min_age=0.

There's another situation in which vacuum_freeze_min_age could apply,
though: suppose the page isn't all-visible yet. I'd argue that in that
case we don't want to run around freezing stuff unless it's quite old
- like older than vacuum_freeze_table_age, say. Because we know we're
going to have to revisit this page in the next vacuum anyway, and
expending effort to freeze tuples that may be about to be modified
again doesn't seem prudent. So, hmm, on further reflection, maybe it's
OK to remove vacuum_freeze_min_age. But if we do, then I think we had
better carefully distinguish between the case where the page can
thereby be marked all-frozen and the case where it cannot. I guess you
say the same, further down.

So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are.

I agree, but again, it's only a backstop. With the patch we'd have to
be rather unlucky to ever need to wait like this.

What are the chances that we keep failing to freeze an old XID from
one particular page, again and again? My testing indicates that it's a
negligible concern in practice (barring pathological cases with idle
cursors, etc).

I mean, those kinds of pathological cases happen *all the time*. Sure,
there are plenty of users who don't leave cursors open. But the ones
who do don't leave them around for short periods of time on randomly
selected pages of the table. They are disproportionately likely to
leave them on the same table pages over and over, just like data can't
in general be assumed to be uniformly accessed. And not uncommonly,
they leave them around until the snow melts.

And we need to worry about those kinds of users, actually much more
than we need to worry about users doing normal things. Honestly,
autovacuum on a system where things are mostly "normal" - no
long-running transactions, adequate resources for autovacuum to do its
job, reasonable configuration settings - isn't that bad. It's true
that there are people who get surprised by an aggressive autovacuum
kicking off unexpectedly, but it's usually the first one during the
cluster lifetime (which is typically the biggest, since the initial
load tends to be bigger than later ones) and it's usually annoying but
survivable. The places where autovacuum becomes incredibly frustrating
are the pathological cases. When insufficient resources are available
to complete the work in a timely fashion, or difficult trade-offs have
to be made, autovacuum is too dumb to make the right choices. And even
if you call your favorite PostgreSQL support provider and they provide
an expert, once it gets behind, autovacuum isn't very tractable: it
will insist on vacuuming everything, right now, in an order that it
chooses, and it's not going to listen to take any nonsense from some
human being who thinks they might have some useful advice to provide!

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

Hmm, I didn't realize that we had that. Is that an existing thing or
something new you're proposing to do? If existing, where is it?

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

On second thought I agree that my specific example of 1.5 billion XIDs
was a little too optimistic of me. But 50 million XIDs (i.e. the
vacuum_freeze_min_age default) is too pessimistic. The important point
is that FreezeLimit could plausibly become nothing more than a
backstop mechanism, with the design from the patch series -- something
that typically has no effect on what tuples actually get frozen.

I agree that it's OK for this to become a purely backstop mechanism
... but again, I think that the design of such backstop mechanisms
should be done as carefully as we know how, because users seem to hit
the backstop all the time. We want it to be made of, you know, nylon
twine, rather than, say, sharp nails. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#18)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Jan 7, 2022 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote:

This seems like a weak argument. Sure, you COULD hard-code the limit
to be autovacuum_freeze_max_age/2 rather than making it a separate
tunable, but I don't think it's better. I am generally very skeptical
about the idea of using the same GUC value for multiple purposes,
because it often turns out that the optimal value for one purpose is
different than the optimal value for some other purpose.

I thought I was being conservative by suggesting
autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to
make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me
these two concepts really *are* the same thing: vacrel->FreezeLimit
becomes a backstop, just as anti-wraparound autovacuum (the
autovacuum_freeze_max_age cutoff) becomes a backstop.

Of course, an anti-wraparound VACUUM will do early freezing in the
same way as any other VACUUM will (with the patch series). So even
when the FreezeLimit backstop XID cutoff actually affects the behavior
of a given VACUUM operation, it may well not be the reason why most
individual tuples that we freeze get frozen. That is, most individual
heap pages will probably have tuples frozen for some other reason.
Though it depends on workload characteristics, most individual heap
pages will typically be frozen as a group, even here. This is a
logical consequence of the fact that tuple freezing and advancing
relfrozenxid are now only loosely coupled -- it's about as loose as
the current relfrozenxid invariant will allow.

I feel generally that a lot of the argument you're making here
supposes that tables are going to get vacuumed regularly.

I agree that
IF tables are being vacuumed on a regular basis, and if as part of
that we always push relfrozenxid forward as far as we can, we will
rarely have a situation where aggressive strategies to avoid
wraparound are required.

It's all relative. We hope that (with the patch) cases that only ever
get anti-wraparound VACUUMs are limited to tables where nothing else
drives VACUUM, for sensible reasons related to workload
characteristics (like the pgbench_accounts example upthread). It's
inevitable that some users will misconfigure the system, though -- no
question about that.

I don't see why users that misconfigure the system in this way should
be any worse off than they would be today. They probably won't do
substantially less freezing (usually somewhat more), and will advance
pg_class.relfrozenxid in exactly the same way as today (usually a bit
better, actually). What have I missed?

Admittedly the design of the "Freeze tuples early to advance
relfrozenxid" patch (i.e. v5-0005-*patch) is still unsettled; I need
to verify that my claims about it are really robust. But as far as I
know they are. Reviewers should certainly look at that with a critical
eye.

Now, I agree with you in part: I don't think it's obvious that it's
useful to tune vacuum_freeze_table_age.

That's definitely the easier argument to make. After all,
vacuum_freeze_table_age will do nothing unless VACUUM runs before the
anti-wraparound threshold (autovacuum_freeze_max_age) is reached. The
patch series should be strictly better than that. Primarily because
it's "continuous", and so isn't limited to cases where the table age
falls within the "vacuum_freeze_table_age - autovacuum_freeze_max_age"
goldilocks age range.

We should be VERY conservative about removing
existing settings if there's any chance that somebody could use them
to tune their way out of trouble.

I agree, I suppose, but right now I honestly can't think of a reason
why they would be useful.

If I am wrong about this then I'm probably also wrong about some basic
facet of the high-level design, in which case I should change course
altogether. In other words, removing the GUCs is not an incidental
thing. It's possible that I would never have pursued this project if I
didn't first notice how wrong-headed the GUCs are.

So, let's see: if we see a page where the tuples are all-visible and
we seize the opportunity to freeze it, we can spare ourselves the need
to ever visit that page again (unless it gets modified). But if we
only mark it all-visible and leave the freezing for later, the next
aggressive vacuum will have to scan and dirty the page. I'm prepared
to believe that it's worth the cost of freezing the page in that
scenario.

That's certainly the most compelling reason to perform early freezing.
It's not completely free of downsides, but it's pretty close.

There's another situation in which vacuum_freeze_min_age could apply,
though: suppose the page isn't all-visible yet. I'd argue that in that
case we don't want to run around freezing stuff unless it's quite old
- like older than vacuum_freeze_table_age, say. Because we know we're
going to have to revisit this page in the next vacuum anyway, and
expending effort to freeze tuples that may be about to be modified
again doesn't seem prudent. So, hmm, on further reflection, maybe it's
OK to remove vacuum_freeze_min_age. But if we do, then I think we had
better carefully distinguish between the case where the page can
thereby be marked all-frozen and the case where it cannot. I guess you
say the same, further down.

I do. Although v5-0005-*patch still freezes early when the page is
dirtied by pruning, I have my doubts about that particular "freeze
early" criteria. I believe that everything I just said about
misconfigured autovacuums doesn't rely on anything more than the "most
compelling scenario for early freezing" mechanism that arranges to
make us set the all-frozen bit (not just the all-visible bit).

I mean, those kinds of pathological cases happen *all the time*. Sure,
there are plenty of users who don't leave cursors open. But the ones
who do don't leave them around for short periods of time on randomly
selected pages of the table. They are disproportionately likely to
leave them on the same table pages over and over, just like data can't
in general be assumed to be uniformly accessed. And not uncommonly,
they leave them around until the snow melts.

And we need to worry about those kinds of users, actually much more
than we need to worry about users doing normal things.

I couldn't agree more. In fact, I was mostly thinking about how to
*help* these users. Insisting on waiting for a cleanup lock before it
becomes strictly necessary (when the table age is only 50
million/vacuum_freeze_min_age) is actually a big part of the problem
for these users. vacuum_freeze_min_age enforces a false dichotomy on
aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a
cleanup lock fix anything?

Even in the extreme case where we are guaranteed to eventually have a
wraparound failure in the end (due to an idle cursor in an
unsupervised database), the user is still much better off, I think. We
will have at least managed to advance relfrozenxid to the exact oldest
XID on the one heap page that somebody holds an idle cursor
(conflicting buffer pin) on. And we'll usually have frozen most of the
tuples that need to be frozen. Sure, the user may need to use
single-user mode to run a manual VACUUM, but at least this process
only needs to freeze approximately one tuple to get the system back
online again.

If the DBA notices the problem before the database starts to refuse to
allocate XIDs, then they'll have a much better chance of avoiding a
wraparound failure through simple intervention (like killing the
backend with the idle cursor). We can pay down 99.9% of the "freeze
debt" independently of this intractable problem of something holding
onto an idle cursor.

Honestly,
autovacuum on a system where things are mostly "normal" - no
long-running transactions, adequate resources for autovacuum to do its
job, reasonable configuration settings - isn't that bad.

Right. Autovacuum is "too big to fail".

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

Hmm, I didn't realize that we had that. Is that an existing thing or
something new you're proposing to do? If existing, where is it?

It's part of v5-0005-*patch. Still in flux to some degree, because
it's necessary to balance a few things. That shouldn't undermine the
arguments I've made here.

I agree that it's OK for this to become a purely backstop mechanism
... but again, I think that the design of such backstop mechanisms
should be done as carefully as we know how, because users seem to hit
the backstop all the time. We want it to be made of, you know, nylon
twine, rather than, say, sharp nails. :-)

Absolutely. But if autovacuum can only ever run due to
age(relfrozenxid) reaching autovacuum_freeze_max_age, then I can't see
a downside.

Again, the v5-0005-*patch needs to meet the standard that I've laid
out. If it doesn't then I've messed up already.

--
Peter Geoghegan

#20Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#19)
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Jan 7, 2022 at 5:20 PM Peter Geoghegan <pg@bowt.ie> wrote:

I thought I was being conservative by suggesting
autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to
make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me
these two concepts really *are* the same thing: vacrel->FreezeLimit
becomes a backstop, just as anti-wraparound autovacuum (the
autovacuum_freeze_max_age cutoff) becomes a backstop.

I can't follow this. If the idea is that we're going to
opportunistically freeze a page whenever that allows us to mark it
all-visible, then the remaining question is what XID age we should use
to force freezing when that rule doesn't apply. It seems to me that
there is a rebuttable presumption that that case ought to work just as
it does today - and I think I hear you saying that it should NOT work
as it does today, but should use some other threshold. Yet I can't
understand why you think that.

I couldn't agree more. In fact, I was mostly thinking about how to
*help* these users. Insisting on waiting for a cleanup lock before it
becomes strictly necessary (when the table age is only 50
million/vacuum_freeze_min_age) is actually a big part of the problem
for these users. vacuum_freeze_min_age enforces a false dichotomy on
aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a
cleanup lock fix anything?

Because waiting on a lock means that we'll acquire it as soon as it's
available. If you repeatedly call your local Pizzeria Uno's and ask
whether there is a wait, and head to the restaurant only when the
answer is in the negative, you may never get there, because they may
be busy every time you call - especially if you always call around
lunch or dinner time. Even if you eventually get there, it may take
multiple days before you find a time when a table is immediately
available, whereas if you had just gone over there and stood in line,
you likely would have been seated in under an hour and savoring the
goodness of quality deep-dish pizza not too long thereafter. The same
principle applies here.

I do think that waiting for a cleanup lock when the age of the page is
only vacuum_freeze_min_age seems like it might be too aggressive, but
I don't think that's how it works. AFAICS, it's based on whether the
vacuum is marked as aggressive, which has to do with
vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the
question around: if the age of the oldest XID on the page is >150
million transactions and the buffer cleanup lock is not available now,
what makes you think that it's any more likely to be available when
the XID age reaches 200 million or 300 million or 700 million? There
is perhaps an argument for some kind of tunable that eventually shoots
the other session in the head (if we can identify it, anyway) but it
seems to me that regardless of what threshold we pick, polling is
strictly less likely to find a time when the page is available than
waiting for the cleanup lock. It has the counterbalancing advantage of
allowing the autovacuum worker to do other useful work in the meantime
and that is indeed a significant upside, but at some point you're
going to have to give up and admit that polling is a failed strategy,
and it's unclear why 150 million XIDs - or probably even 50 million
XIDs - isn't long enough to say that we're not getting the job done
with half measures.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#20)
In reply to: Peter Geoghegan (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#21)
In reply to: Robert Haas (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#24)
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#26)
In reply to: Robert Haas (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#28)
In reply to: Robert Haas (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#30)
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#32)
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#34)
In reply to: Robert Haas (#35)
#37Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#36)
In reply to: Bruce Momjian (#37)
In reply to: Peter Geoghegan (#36)
#40John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#39)
In reply to: John Naylor (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#39)
In reply to: Robert Haas (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#43)
In reply to: Robert Haas (#44)
#46Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#9)
In reply to: Bruce Momjian (#46)
In reply to: Peter Geoghegan (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#46)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#47)
In reply to: Robert Haas (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#51)
In reply to: Robert Haas (#52)
In reply to: Peter Geoghegan (#39)
In reply to: Peter Geoghegan (#54)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#55)
In reply to: Robert Haas (#56)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#57)
#59Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#57)
#60Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#56)
In reply to: Robert Haas (#58)
In reply to: Andres Freund (#59)
In reply to: Peter Geoghegan (#62)
#64Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#63)
In reply to: Peter Geoghegan (#63)
In reply to: Peter Geoghegan (#65)
#67Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#66)
In reply to: Andres Freund (#67)
#69Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#68)
In reply to: Andres Freund (#69)
In reply to: Andres Freund (#69)
#72Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#70)
In reply to: Peter Geoghegan (#68)
In reply to: Andres Freund (#72)
#75Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#73)
In reply to: Andres Freund (#75)
#77Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#76)
In reply to: Andres Freund (#77)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#67)
#80Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#61)
In reply to: Robert Haas (#80)
In reply to: Peter Geoghegan (#81)
#83Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#82)
In reply to: Andres Freund (#83)
#85Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#84)
In reply to: Peter Geoghegan (#84)
#87Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#86)
In reply to: Andres Freund (#87)
In reply to: Andres Freund (#85)
#90Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#81)
In reply to: Robert Haas (#90)
In reply to: Peter Geoghegan (#89)
In reply to: Peter Geoghegan (#92)
#94Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#93)
In reply to: Robert Haas (#94)
#96Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#95)
In reply to: Robert Haas (#96)
#98Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#97)
In reply to: Thomas Munro (#98)
#100Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#99)
In reply to: Robert Haas (#100)
#102Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#101)
In reply to: Robert Haas (#102)
In reply to: Peter Geoghegan (#103)
#105Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#104)
In reply to: Robert Haas (#105)
In reply to: Peter Geoghegan (#106)
#108Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Geoghegan (#107)
In reply to: Justin Pryzby (#108)
In reply to: Peter Geoghegan (#109)
#111Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#110)
In reply to: Andres Freund (#111)
In reply to: Peter Geoghegan (#112)
#114Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#112)
In reply to: Andres Freund (#114)
#116Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#114)
#117Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#115)
In reply to: Andres Freund (#117)
#119Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#117)
#120Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#118)
In reply to: Andres Freund (#119)
In reply to: Peter Geoghegan (#121)
#123Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#121)
#124Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#123)
In reply to: Andres Freund (#124)
#126Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#125)
In reply to: Andres Freund (#123)
In reply to: Andres Freund (#126)
#129Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#127)
In reply to: Andres Freund (#129)
In reply to: Peter Geoghegan (#130)
#132Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#131)
In reply to: Andres Freund (#132)
In reply to: Peter Geoghegan (#131)
#135Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#134)
In reply to: Andres Freund (#135)
In reply to: Peter Geoghegan (#136)
#138Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#132)
In reply to: Jim Nasby (#138)