New strategies for freezing, advancing relfrozenxid early

Started by Peter Geogheganover 3 years ago140 messageshackers
Jump to latest

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Recap
=====

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once, causing significant
disruption to production workloads. The patches teach VACUUM to care
about how far behind it is on freezing for each table -- the number of
unfrozen all-visible pages that have accumulated so far is directly
and explicitly kept under control over time. Unfrozen pages can be
seen as debt. There isn't necessarily anything wrong with getting into
debt (getting into debt to a small degree is all but inevitable), but
debt can be dangerous when it isn't managed carefully. Accumulating
large amounts of debt doesn't always end badly, but it does seem to
reliably create the *risk* that things will end badly.

Right now, a standard append-only table could easily do *all* freezing
in aggressive/antiwraparound VACUUM, without any earlier
non-aggressive VACUUM operations triggered by
autovacuum_vacuum_insert_threshold doing any freezing at all (unless
the user goes out of their way to tune vacuum_freeze_min_age). There
is currently no natural limit on the number of unfrozen all-visible
pages that can accumulate -- unless you count age(relfrozenxid), the
triggering condition for antiwraparound autovacuum. But relfrozenxid
age predicts almost nothing about how much freezing is required (or
will be required later on). The overall result is that it oftens takes
far too long for freezing to finally happen, even when the table
receives plenty of autovacuums (they all could freeze something, but
in practice just don't freeze anything). It's very hard to avoid that
through tuning, because what we really care about is something pretty
closely related to (if not exactly) the number of unfrozen heap pages
in the system. XID age is fundamentally "the wrong unit" here -- the
physical cost of freezing is the most important thing, by far.

In short, the goal of the patch series/project is to make autovacuum
scheduling much more predictable over time. Especially with very large
append-only tables. The patches improve the performance stability of
VACUUM by managing costs holistically, over time. What happens in one
single VACUUM operation is much less important than the behavior of
successive VACUUM operations over time.

What's new: freezing/skipping strategies
========================================

This newly overhauled version introduces the concept of
per-VACUUM-operation strategies, which we decide on once per VACUUM,
at the very start. There are 2 choices to be made at this point (right
after we acquire OldestXmin and similar cutoffs):

1) Do we scan all-visible pages, or do we skip instead? (Added by
second patch, involves a trade-off between eagerness and laziness.)
2) How should we freeze -- eagerly or lazily? (Added by third patch)

The strategy-based approach can be thought of as something that blurs
the distinction between aggressive and non-aggressive VACUUM, giving
VACUUM more freedom to do either more or less work, based on known
costs and benefits. This doesn't completely supersede
aggressive/antiwraparound VACUUMs, but should make them much rarer
with larger tables, where controlling freeze debt actually matters.
There is a need to keep laziness and eagerness in balance here. We try
to get the benefit of lazy behaviors/strategies, but will still course
correct when it doesn't work out.

A new GUC/reloption called vacuum_freeze_strategy_threshold is added
to control freezing strategy (also influences our choice of skipping
strategy). It defaults to 4GB, so tables smaller than that cutoff
(which are usually the majority of all tables) will continue to freeze
in much the same way as today by default. Our current lazy approach to
freezing makes sense there, and should be preserved for its own sake.

Compatibility
=============

Structuring the new freezing behavior as an explicit user-configurable
strategy is also useful as a bridge between the old and new freezing
behaviors. It makes it fairly easy to get the old/current behavior
where that's preferred -- which, I must admit, is something that
wasn't well thought through last time around. The
vacuum_freeze_strategy_threshold GUC is effectively (though not
explicitly) a compatibility option. Users that want something close to
the old/current behavior can use the GUC or reloption to more or less
opt-out of the new freezing behavior, and can do so selectively. The
GUC should be easy for users to understand, too -- it's just a table
size cutoff.

Skipping pages using a snapshot of the visibility map
=====================================================

We now take a copy of the visibility map at the point that VACUUM
begins, and work off of that when skipping, instead of working off of
the mutable/authoritative VM -- this is a visibility map snapshot.
This new infrastructure helps us to decide on a skipping strategy.
Every non-aggressive VACUUM operation now has a choice to make: Which
skipping strategy should it use? (This was introduced as
item/question #1 a moment ago.)

The decision on skipping strategy is a decision about our priorities
for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page). If it's the latter, we'll scan exactly 0
all-visible pages. Either way, once a decision has been made, we don't
leave much to chance -- we commit. ISTM that this is the only approach
that really makes sense. Fundamentally, we advance relfrozenxid a
table at a time, and at most once per VACUUM operation. And for larger
tables it's just impossible as a practical matter to have frequent
VACUUM operations. We ought to be *somewhat* biased in the direction
of advancing relfrozenxid by *some* amount during each VACUUM, even
when relfrozenxid isn't all that old right now.

A strategy (whether for skipping or for freezing) is a big, up-front
decision -- and there are certain kinds of risks that naturally
accompany that approach. The information driving the decision had
better be fairly reliable! By using a VM snapshot, we can choose our
skipping strategy based on precise information about how many *extra*
pages we will have to scan if we go with eager scanning/relfrozenxid
advancement. Concurrent activity cannot change what we scan and what
we skip, either -- everything is locked in from the start. That seems
important to me. It justifies trying to advance relfrozenxid early,
just because the added cost of scanning any all-visible pages happens
to be low.

This is quite a big shift for VACUUM, at least in some ways. The patch
adds a DETAIL to the "starting vacuuming" INFO message shown by VACUUM
VERBOSE. The VERBOSE output is already supposed to work as a
rudimentary progress indicator (at least when it is run at the
database level), so it now shows the final scanned_pages up-front,
before the physical scan of the heap even begins:

regression=# vacuum verbose tenk1;
INFO: vacuuming "regression.public.tenk1"
DETAIL: total table size is 486 pages, 3 pages (0.62% of total) must be scanned
INFO: finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 486 remain, 3 scanned (0.62% of total)
*** SNIP ***
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

I included this VERBOSE tweak in the second patch because it became
natural with VM snapshots, and not because it felt particularly
compelling -- scanned_pages just works like this now (an assertion
verifies that our initial scanned_pages is always an exact match to
what happened during the physical scan, in fact).

There are many things that VM snapshots might also enable that aren't
particularly related to freeze debt. VM snapshotting has the potential
to enable more flexible behavior by VACUUM. I'm thinking of things
like suspend-and-resume for VACUUM/autovacuum, or even autovacuum
scheduling that coordinates autovacuum workers before and during
processing by vacuumlazy.c. Locking-in scanned_pages up-front avoids
the main downside that comes with throttling VACUUM right now: the
fact that simply taking our time during VACUUM will tend to increase
the number of concurrently modified pages that we end up scanning.
These pages are bound to mostly just contain "recently dead" tuples
that the ongoing VACUUM can't do much about anyway -- we could dirty a
lot more heap pages as a result, for little to no benefit.

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

Reducing the WAL space overhead of freezing
===========================================

Not included in this new v1 are other patches that control the
overhead of added freezing -- my focus since joining AWS has been to
get these more strategic patches in shape, and telling the right story
about what I'm trying to do here. I'm going to say a little on the
patches that I have in the pipeline here, though. Getting the
low-level/mechanical overhead of freezing under control will probably
require a few complementary techniques, not just high-level strategies
(though the strategy stuff is the most important piece).

The really interesting omitted-in-v1 patch adds deduplication of
xl_heap_freeze_page WAL records. This reduces the space overhead of
WAL records used to freeze by ~5x in most cases. It works in the
obvious way: we just store the 12 byte freeze plans that appear in
each xl_heap_freeze_page record only once, and then store an array of
item offset numbers for each entry (rather than naively storing a full
12 bytes per tuple frozen per page-level WAL record). This means that
we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen
(2 bytes for an OffsetNumber) once we decide to freeze something on
the same page. The *marginal* cost can be much lower than it is today,
which makes page-based batching of freezing much more compelling IMV.

Thoughts?
--
Peter Geoghegan

Attachments:

v1-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v1-0001-Add-page-level-freezing-to-VACUUM.patchDownload+230-135
v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload+35-15
v1-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v1-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload+155-19
v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload+367-126
#2Jeremy Schneider
schnjere@amazon.com
In reply to: Peter Geoghegan (#1)
Re: New strategies for freezing, advancing relfrozenxid early

On 8/25/22 2:21 PM, Peter Geoghegan wrote:

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

We should be careful here. IIUC, the current autovac behavior helps
bound the "spread" or range of active multixact IDs in the system, which
directly determines the number of distinct pages that contain those
multixacts. If the proposed change herein causes the spread/range of
MXIDs to significantly increase, then it will increase the number of
blocks and increase the probability of thrashing on the SLRUs for these
data structures. There may be another separate thread or two about
issues with SLRUs already?

-Jeremy

PS. see also
/messages/by-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a950@ardentperf.com

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In reply to: Jeremy Schneider (#2)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Aug 25, 2022 at 3:35 PM Jeremy Schneider <schnjere@amazon.com> wrote:

We should be careful here. IIUC, the current autovac behavior helps
bound the "spread" or range of active multixact IDs in the system, which
directly determines the number of distinct pages that contain those
multixacts. If the proposed change herein causes the spread/range of
MXIDs to significantly increase, then it will increase the number of
blocks and increase the probability of thrashing on the SLRUs for these
data structures.

As a general rule VACUUM will tend to do more eager freezing with the
patch set compared to HEAD, though it should never do less eager
freezing. Not even in corner cases -- never.

With the patch, VACUUM pretty much uses the most aggressive possible
XID-wise/MXID-wise cutoffs in almost all cases (though only when we
actually decide to freeze a page at all, which is now a separate
question). The fourth patch in the patch series introduces a very
limited exception, where we use the same cutoffs that we'll always use
on HEAD (FreezeLimit + MultiXactCutoff) instead of the aggressive
variants (OldestXmin and OldestMxact). This isn't just *any* xmax
containing a MultiXact: it's a Multi that contains *some* XIDs that
*need* to go away during the ongoing VACUUM, and others that *cannot*
go away. Oh, and there usually has to be a need to keep two or more
XIDs for this to happen -- if there is only one XID then we can
usually swap xmax with that XID without any fuss.

PS. see also
/messages/by-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a950@ardentperf.com

I think that the problem you describe here is very real, though I
suspect that it needs to be addressed by making opportunistic cleanup
of Multis happen more reliably. Running VACUUM more often just isn't
practical once a table reaches a certain size. In general, any kind of
processing that is time sensitive probably shouldn't be happening
solely during VACUUM -- it's just too risky. VACUUM might take a
relatively long time to get to the affected page. It might not even be
that long in wall clock time or whatever -- just too long to reliably
avoid the problem.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#3)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Aug 25, 2022 at 4:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

As a general rule VACUUM will tend to do more eager freezing with the
patch set compared to HEAD, though it should never do less eager
freezing. Not even in corner cases -- never.

Come to think of it, I don't think that that's quite true. Though the
fourth patch isn't particularly related to the problem.

It *is* true that VACUUM will do at least as much freezing of XID
based tuple header fields as before. That just leaves MXIDs. It's even
true that we will do just as much freezing as before if you go pure on
MultiXact-age. But I'm the one that likes to point out that age is
altogether the wrong approach for stuff like this -- so that won't cut
it.

More concretely, I think that the patch series will fail to do certain
inexpensive eager processing of tuple xmax, that will happen today,
regardless of what the user has set vacuum_freeze_min_age or
vacuum_multixact_freeze_min_age to. Although we currently only care
about XID age when processing simple XIDs, we already sometimes make
trade-offs similar to the trade-off I propose to make in the fourth
patch for Multis.

In other words, on HEAD, we promise to process any XMID >=
MultiXactCutoff inside FreezeMultiXactId(). But we also manage to do
"eager processing of xmax" when it's cheap and easy to do so, without
caring about MultiXactCutoff at all -- this is likely the common case,
even. This preexisting eager processing of Multis is likely important
in many applications.

The problem that I think I've created is that page-level freezing as
implemented in lazy_scan_prune by the third patch doesn't know that we
might be a good idea to execute a subset of freeze plans, in order to
remove a multi from a page right away. It mostly has the right idea by
holding off on freezing until it looks like a good idea at the level
of the whole page, but I think that this is a plausible exception.
Just because we're much more sensitive to leaving behind an Multi, and
right now the only code path that can remove a Multi that isn't needed
anymore is FreezeMultiXactId().

If xmax was an updater that aborted instead of a multi then we could
rely on hint bits being set by pruning to avoid clog lookups.
Technically nobody has violated their contract here, I think, but it
still seems like it could easily be unacceptable.

I need to come up with my own microbenchmark suite for Multis -- that
was on my TODO list already. I still believe that the fourth patch
addresses Andres' complaint about allocating new Multis during VACUUM.
But it seems like I need to think about the nuances of Multis some
more. In particular, what the performance impact might be of making a
decision on freezing at the page level, in light of the special
performance considerations for Multis.

Maybe it needs to be more granular than that, more often. Or maybe we
can comprehensively solve the problem in some other way entirely.
Maybe pruning should do this instead, in general. Something like that
might put this right, and be independently useful.

--
Peter Geoghegan

#5Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#1)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once, causing significant
disruption to production workloads.

Sounds like a good goal, and loosely follows the precedent of
checkpoint targets and vacuum cost delays.

A new GUC/reloption called vacuum_freeze_strategy_threshold is added
to control freezing strategy (also influences our choice of skipping
strategy). It defaults to 4GB, so tables smaller than that cutoff
(which are usually the majority of all tables) will continue to
freeze
in much the same way as today by default. Our current lazy approach
to
freezing makes sense there, and should be preserved for its own sake.

Why is the threshold per-table? Imagine someone who has a bunch of 4GB
partitions that add up to a huge amount of deferred freezing work.

The initial problem you described is a system-level problem, so it
seems we should track the overall debt in the system in order to keep
up.

for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page).

This feels too absolute, to me. If the goal is to freeze more
incrementally, well in advance of wraparound limits, then why can't we
just freeze 1000 out of 10000 freezable pages on this run, and then
leave the rest for a later run?

Thoughts?

What if we thought about this more like a "background freezer". It
would keep track of the total number of unfrozen pages in the system,
and freeze them at some kind of controlled/adaptive rate.

Regular autovacuum's job would be to keep advancing relfrozenxid for
all tables and to do other cleanup, and the background freezer's job
would be to keep the absolute number of unfrozen pages under some
limit. Conceptually those two jobs seem different to me.

Also, regarding patch v1-0001-Add-page-level-freezing, do you think
that narrows the conceptual gap between an all-visible page and an all-
frozen page?

Regards,
Jeff Davis

In reply to: Jeff Davis (#5)
Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Aug 29, 2022 at 11:47 AM Jeff Davis <pgsql@j-davis.com> wrote:

Sounds like a good goal, and loosely follows the precedent of
checkpoint targets and vacuum cost delays.

Right.

Why is the threshold per-table? Imagine someone who has a bunch of 4GB
partitions that add up to a huge amount of deferred freezing work.

I think it's possible that our cost model will eventually become very
sophisticated, and weigh all kinds of different factors, and work as
one component of a new framework that dynamically schedules autovacuum
workers. My main goal in posting this v1 was validating the *general
idea* of strategies with cost models, and the related question of how
we might use VM snapshots for that. After all, even the basic concept
is totally novel.

The initial problem you described is a system-level problem, so it
seems we should track the overall debt in the system in order to keep
up.

I agree that the problem is fundamentally a system-level problem. One
reason why vacuum_freeze_strategy_threshold works at the table level
right now is to get the ball rolling. In any case the specifics of how
we trigger each strategy are from from settled. That's not the only
reason why we think about things at the table level in the patch set,
though.

There *are* some fundamental reasons why we need to care about
individual tables, rather than caring about unfrozen pages at the
system level *exclusively*. This is something that
vacuum_freeze_strategy_threshold kind of gets right already, despite
its limitations. There are 2 aspects of the design that seemingly have
to work at the whole table level:

1. Concentration matters when it comes to wraparound risk.

Fundamentally, each VACUUM still targets exactly one heap rel, and
advances relfrozenxid at most once per VACUUM operation. While the
total number of "unfrozen heap pages" across the whole database is the
single most important metric, it's not *everything*.

As a general rule, there is much less risk in having a certain fixed
number of unfrozen heap pages spread fairly evenly among several
larger tables, compared to the case where the same number of unfrozen
pages are all concentrated in one particular table -- right now it'll
often be one particular table that is far larger than any other table.
Right now the pain is generally felt with large tables only.

2. We need to think about things at the table level is to manage costs
*over time* holistically. (Closely related to #1.)

The ebb and flow of VACUUM for one particular table is a big part of
the picture here -- and will be significantly affected by table size.
We can probably always afford to risk falling behind on
freezing/relfrozenxid (i.e. we should prefer laziness) if we know that
we'll almost certainly be able to catch up later when things don't
quite work out. That makes small tables much less trouble, even when
there are many more of them (at least up to a point).

As you know, my high level goal is to avoid ever having to make huge
balloon payments to catch up on freezing, which is a much bigger risk
with a large table -- this problem is mostly a per-table problem (both
now and in the future).

A large table will naturally require fewer, larger VACUUM operations
than a small table, no matter what approach is taken with the strategy
stuff. We therefore have fewer VACUUM operations in a given
week/month/year/whatever to spread out the burden -- there will
naturally be fewer opportunities. We want to create the impression
that each autovacuum does approximately the same amount of work (or at
least the same per new heap page for large append-only tables).

It also becomes much more important to only dirty each heap page
during vacuuming ~once with larger tables. With a smaller table, there
is a much higher chance that the pages we modify will already be dirty
from user queries.

for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page).

This feels too absolute, to me. If the goal is to freeze more
incrementally, well in advance of wraparound limits, then why can't we
just freeze 1000 out of 10000 freezable pages on this run, and then
leave the rest for a later run?

My remarks here applied only to the question of relfrozenxid
advancement -- not to freezing. Skipping strategy (relfrozenxid
advancement) is a distinct though related concept to freezing
strategy. So I was making a very narrow statement about
invariants/basic correctness rules -- I wasn't arguing against
alternative approaches to freezing beyond the 2 freezing strategies
(not to be confused with skipping strategies) that appear in v1.
That's all I meant -- there is definitely no point in scanning only a
subset of the table's all-visible pages, as far as relfrozenxid
advancement is concerned (and skipping strategy is fundamentally a
choice about relfrozenxid advancement vs work avoidance, eagerness vs
laziness).

Maybe you're right that there is room for additional freezing
strategies, besides the two added by v1-0003-*patch. Definitely seems
possible. The freezing strategy concept should be usable as a
framework for adding additional strategies, including (just for
example) a strategy that decides ahead of time to freeze only so many
pages, though not others (without regard for the fact that the pages
that we are freezing may not be very different to those we won't be
freezing in the current VACUUM).

I'm definitely open to that. It's just a matter of characterizing what
set of workload characteristics this third strategy would solve, how
users might opt in or opt out, etc. Both the eager and the lazy
freezing strategies are based on some notion of what's important for
the table, based on its known characteristics, and based on what seems
like to happen to the table in the future (the next VACUUM, at least).
I'm not completely sure how many strategies we'll end up needing.
Though it seems like the eager/lazy trade-off is a really important
part of how these strategies will need to work, in general.

(Thinks some more) I guess that such an alternative freezing strategy
would probably have to affect the skipping strategy too. It's tricky
to tease apart because it breaks the idea that skipping strategy and
freezing strategy are basically distinct questions. That is a factor
that makes it a bit more complicated to discuss. In any case, as I
said, I have an open mind about alternative freezing strategies beyond
the 2 basic lazy/eager freezing strategies from the patch.

What if we thought about this more like a "background freezer". It
would keep track of the total number of unfrozen pages in the system,
and freeze them at some kind of controlled/adaptive rate.

I like the idea of storing metadata in shared memory. And scheduling
and deprioritizing running autovacuums. Being able to slow down or
even totally halt a given autovacuum worker without much consequence
is enabled by the VM snapshot concept.

That said, this seems like future work to me. Worth discussing, but
trying to keep out of scope in the first version of this that is
committed.

Regular autovacuum's job would be to keep advancing relfrozenxid for
all tables and to do other cleanup, and the background freezer's job
would be to keep the absolute number of unfrozen pages under some
limit. Conceptually those two jobs seem different to me.

The problem with making it such a sharp distinction is that it can be
very useful to manage costs by making it the job of VACUUM to do both
-- we can avoid dirtying the same page multiple times.

I think that we can accomplish the same thing by giving VACUUM more
freedom to do either more or less work, based on the observed
characteristics of the table, and some sense of how costs will tend to
work over time. across multiple distinct VACUUM operations. In
practice that might end up looking very similar to what you describe.

It seems undesirable for VACUUM to ever be too sure of itself -- the
information that triggers autovacuum may not be particularly reliable,
which can be solved to some degree by making as many decisions as
possible at runtime, dynamically, based on the most authoritative and
recent information. Delaying committing to one particular course of
action isn't always possible, but when it is possible (and not too
expensive) we should do it that way on general principle.

Also, regarding patch v1-0001-Add-page-level-freezing, do you think
that narrows the conceptual gap between an all-visible page and an all-
frozen page?

Yes, definitely. However, I don't think that we can just get rid of
the distinction completely -- though I did think about it for a while.
For one thing we need to be able to handle cases like the case where
heap_lock_tuple() modifies an all-frozen page, and makes it
all-visible without making it completely unskippable to every VACUUM
operation.

--
Peter Geoghegan

#7Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#1)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Recap
=====

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once

I agree with the motivation: that keeping around a lot of deferred work
(unfrozen pages) is risky, and that administrators would want a way to
control that risk.

The solution involves more changes to the philosophy and mechanics of
vacuum than I would expect, though. For instance, VM snapshotting,
page-level-freezing, and a cost model all might make sense, but I don't
see why they are critical for solving the problem above. I think I'm
still missing something. My mental model is closer to the bgwriter and
checkpoint_completion_target.

Allow me to make a naive counter-proposal (not a real proposal, just so
I can better understand the contrast with your proposal):

* introduce a reloption unfrozen_pages_target (default -1, meaning
infinity, which is the current behavior)
* introduce two fields to LVRelState: n_pages_frozen and
delay_skip_count, both initialized to zero
* when marking a page frozen: n_pages_frozen++
* when vacuum begins:
if (unfrozen_pages_target >= 0 &&
current_unfrozen_page_count > unfrozen_pages_target)
{
vacrel->delay_skip_count = current_unfrozen_page_count -
unfrozen_pages_target;
/* ?also use more aggressive freezing thresholds? */
}
* in lazy_scan_skip(), have a final check:
if (vacrel->n_pages_frozen < vacrel->delay_skip_count)
{
break;
}

I know there would still be some problem cases, but to me it seems like
we solve 80% of the problem in a couple dozen lines of code.

a. Can you clarify some of the problem cases, and why it's worth
spending more code to fix them?

b. How much of your effort is groundwork for related future
improvements? If it's a substantial part, can you explain in that
larger context?

c. Can some of your patches be separated into independent discussions?
For instance, patch 1 has been discussed in other threads and seems
independently useful, and I don't see the current work as dependent on
it. Patch 4 also seems largerly independent.

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

Regards,
Jeff Davis

In reply to: Jeff Davis (#7)
Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 11:11 AM Jeff Davis <pgsql@j-davis.com> wrote:

The solution involves more changes to the philosophy and mechanics of
vacuum than I would expect, though. For instance, VM snapshotting,
page-level-freezing, and a cost model all might make sense, but I don't
see why they are critical for solving the problem above.

I certainly wouldn't say that they're critical. I tend to doubt that I
can be perfectly crisp about what the exact relationship is between
each component in isolation and how it contributes towards addressing
the problems we're concerned with.

I think I'm
still missing something. My mental model is closer to the bgwriter and
checkpoint_completion_target.

That's not a bad starting point. The main thing that that mental model
is missing is how the timeframes work with VACUUM, and the fact that
there are multiple timeframes involved (maybe the system's vacuuming
work could be seen as having one timeframe at the highest level, but
it's more of a fractal picture overall). Checkpoints just don't take
that long, and checkpoint duration has a fairly low variance (barring
pathological performance problems).

You only have so many buffers that you can dirty, too -- it's a
self-limiting process. This is even true when (for whatever reason)
the checkpoint_completion_target logic just doesn't do what it's
supposed to do. There is more or less a natural floor on how bad
things can get, so you don't have to invent a synthetic floor at all.
LSM-based DB systems like the MyRocks storage engine for MySQL don't
use checkpoints at all -- the closest analog is compaction, which is
closer to a hybrid of VACUUM and checkpointing than anything else.

The LSM compaction model necessitates adding artificial throttling to
keep the system stable over time [1]https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing. There is a disconnect between
the initial ingest of data, and the compaction process. And so
top-down modelling of costs and benefits with compaction is more
natural with an LSM [2]https://disc-projects.bu.edu/compactionary/research.html -- Peter Geoghegan -- and not a million miles from the strategy
stuff I'm proposing.

Allow me to make a naive counter-proposal (not a real proposal, just so
I can better understand the contrast with your proposal):

I know there would still be some problem cases, but to me it seems like
we solve 80% of the problem in a couple dozen lines of code.

It's not that this statement is wrong, exactly. It's that I believe
that it is all but mandatory for me to ameliorate the downside that
goes with more eager freezing, for example by not doing it at all when
it doesn't seem to make sense. I want to solve the big problem of
freeze debt, without creating any new problems. And if I should also
make things in adjacent areas better too, so much the better.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

a. Can you clarify some of the problem cases, and why it's worth
spending more code to fix them?

For one thing if we're going to do a lot of extra freezing, we really
want to "get credit" for it afterwards, by updating relfrozenxid to
reflect the new oldest extant XID, and so avoid getting an
antiwraparound VACUUM early, in the near future.

That isn't strictly true, of course. But I think that we at least
ought to have a strong bias in the direction of updating relfrozenxid,
having decided to do significantly more freezing in some particular
VACUUM operation.

b. How much of your effort is groundwork for related future
improvements? If it's a substantial part, can you explain in that
larger context?

Hard to say. It's true that the idea of VM snapshots is quite general,
and could have been introduced in a number of different ways. But I
don't think that that should count against it. It's also not something
that seems contrived or artificial -- it's at least as good of a
reason to add VM snapshots as any other I can think of.

Does it really matter if this project is the freeze debt project, or
the VM snapshot project? Do we even need to decide which one it is
right now?

c. Can some of your patches be separated into independent discussions?
For instance, patch 1 has been discussed in other threads and seems
independently useful, and I don't see the current work as dependent on
it.

I simply don't know if I can usefully split it up just yet.

Patch 4 also seems largerly independent.

Patch 4 directly compensates for a problem created by the earlier
patches. The patch series as a whole isn't supposed to amerliorate the
problem of MultiXacts being allocated in VACUUM. It only needs to
avoid making the situation any worse than it is today IMV (I suspect
that the real fix is to make the VACUUM FREEZE command not tune
vacuum_freeze_min_age).

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

I'm not sure. I think that having certainty that we'll be able to scan
only so many pages up-front is very broadly useful, though. Plus it
removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
way that results in scanning many more pages needlessly. See commit
bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
shortly after the visibility map first appeared.

Since relfrozenxid advancement fundamentally works at the table level,
it seems natural to make it a top-down, VACUUM-level thing -- even
within non-aggessive VACUUMs (I guess it already meets that
description in aggressive VACUUMs). And since we really want to
advance relfrozenxid when we do extra freezing (for the reasons I just
went into), it seems natural to me to view it as one problem. I accept
that it's not clear cut, though.

[1]: https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing
[2]: https://disc-projects.bu.edu/compactionary/research.html -- Peter Geoghegan
--
Peter Geoghegan

In reply to: Peter Geoghegan (#8)
Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

I'm not sure. I think that having certainty that we'll be able to scan
only so many pages up-front is very broadly useful, though. Plus it
removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
way that results in scanning many more pages needlessly. See commit
bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
shortly after the visibility map first appeared.

Here is a better example:

Right now the second patch adds both VM snapshots and the skipping
strategy stuff. The VM snapshot is used in the second patch, as a
source of reliable information about how we need to process the table,
in terms of the total number of scanned_pages -- which drives our
choice of strategy. Importantly, we can assess the question of which
skipping strategy to take (in non-aggressive VACUUM) based on 100%
accurate information about how many *extra* pages we'll have to scan
in the event of being eager (i.e. in the event that we prioritize
early relfrozenxid advancement over skipping some pages). Importantly,
that cannot change later on, since VM snapshots are immutable --
everything is locked in. That already seems quite valuable to me.

This general concept could be pushed a lot further without great
difficulty. Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns. We could allow VACUUM to
"change its mind about skipping" in cases where it initially thought
that skipping was the best strategy, only to discover much later on
that that was the wrong choice after all.

A huge amount of new, reliable information will come to light from
scanning the heap rel. In particular, the current value of
vacrel->NewRelfrozenXid seems like it would be particularly
interesting when the time came to consider if a second scan made sense
-- if NewRelfrozenXid is a recent-ish value already, then that argues
for finishing off the all-visible pages in a second heap pass, with
the aim of setting relfrozenxid to a similarly recent value when it
happens to be cheap to do so.

The actual process of scanning precisely those all-visible pages that
were skipped the first time around during a second call to
lazy_scan_heap() can be implemented in the obvious way: by teaching
the VM snapshot infrastructure/lazy_scan_skip() to treat pages that
were skipped the first time around to get scanned during the second
pass over the heap instead. Also, those pages that were scanned the
first time around can/must be skipped on our second pass (excluding
all-frozen pages, which won't be scanned in either heap pass).

I've used the term "second heap pass" here, but that term is slightly
misleading. The final outcome of this whole process is that every heap
page that the vmsnap says VACUUM will need to scan in order for it to
be able to safely advance relfrozenxid will be scanned, precisely
once. The overall order that the heap pages are scanned in will of
course differ from the simple case, but I don't think that it makes
very much difference. In reality there will have only been one heap
pass, consisting of two distinct phases. No individual heap page will
ever be considered for pruning/freezing more than once, no matter
what. This is just a case of *reordering* work. Immutability makes
reordering work easy in general.

--
Peter Geoghegan

#10Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#9)
Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:

Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns.

I like this idea.

Regards,
Jeff Davis

In reply to: Jeff Davis (#10)
Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 9:37 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:

Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns.

I like this idea.

I was hoping that you would. I imagine that this idea (with minor
variations) could enable an approach that's much closer to what you
were thinking of: one that mostly focuses on controlling the number of
unfrozen pages, and not so much on advancing relfrozenxid early, just
because we can and we might not get another chance for a long time. In
other words your idea of a design that can freeze more during a
non-aggressive VACUUM, while still potentially skipping all-visible
pages.

I said earlier on that we ought to at least have a strong bias in the
direction of advancing relfrozenxid in larger tables, especially when
we decide to freeze whole pages more eagerly -- we only get one chance
to advance relfrozenxid per VACUUM, and those opportunities will
naturally be few and far between. We cannot really justify all this
extra freezing if it doesn't prevent antiwraparound autovacuums. That
was more or less my objection to going in that direction.

But if we can more or less double the number of opportunities to at
least ask the question "is now a good time to advance relfrozenxid?"
without really paying much for keeping this option open (and I think
that we can), my concern about relfrozenxid advancement becomes far
less important. Just being able to ask that question is significantly
less rare and precious. Plus we'll probably be able to make
significantly better decisions about relfrozenxid overall with the
"second phase because I changed my mind about skipping" concept in
place.

--
Peter Geoghegan

#12Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#8)
Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 2022-08-30 at 13:45 -0700, Peter Geoghegan wrote:

It's that I believe
that it is all but mandatory for me to ameliorate the downside that
goes with more eager freezing, for example by not doing it at all
when
it doesn't seem to make sense. I want to solve the big problem of
freeze debt, without creating any new problems. And if I should also
make things in adjacent areas better too, so much the better.

That clarifies your point. It's still a challenge for me to reason
about which of these potential new problems really need to be solved in
v1, though.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

I don't think that would actually solve the unbounded buildup of
unfrozen pages. It would still be possible for pages to be marked all
visible before being frozen, and then end up being skipped until an
aggressive vacuum is forced, right?

Or did you mean vacuum_freeze_table_age?

Regards,
Jeff Davis

In reply to: Jeff Davis (#12)
Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 11:28 PM Jeff Davis <pgsql@j-davis.com> wrote:

That clarifies your point. It's still a challenge for me to reason
about which of these potential new problems really need to be solved in
v1, though.

I don't claim to understand it that well myself -- not just yet.
I feel like I have the right general idea, but the specifics
aren't all there (which is very often the case for me at this
point in the cycle). That seems like a good basis for further
discussion.

It's going to be quite a few months before some version of this
patchset is committed, at the very earliest. Obviously these are
questions that need answers, but the process of getting to those
answers is a significant part of the work itself IMV.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

I don't think that would actually solve the unbounded buildup of
unfrozen pages. It would still be possible for pages to be marked all
visible before being frozen, and then end up being skipped until an
aggressive vacuum is forced, right?

With the 15 work in place, and with the insert-driven autovacuum
behavior from 13, it is likely that this would be enough to avoid all
antiwraparound vacuums in an append-only table. There is still one
case where we can throw away the opportunity to advance relfrozenxid
during non-aggressive VACUUMs for no good reason -- I didn't fix them
all just yet. But the remaining case (which is in lazy_scan_skip()) is
very narrow.

With vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age =
0, any page that is eligible to be set all-visible is also eligible to
have its tuples frozen and be set all-frozen instead, immediately.
When it isn't then we'll scan it in the next VACUUM anyway.

Actually I'm also ignoring some subtleties with Multis that could make
this not quite happen, but again, that's only a super obscure corner case.
The idea that just setting vacuum_freeze_min_age = 0 and
vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
in spirit. You don't need to touch vacuum_freeze_table_age (if you did
then you'd get aggressive VACUUMs, and one goal here is to avoid
those whenever possible -- especially aggressive antiwraparound
autovacuums).

--
Peter Geoghegan

In reply to: Peter Geoghegan (#1)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Attached is v2.

This is just to keep CFTester happy, since v1 now has conflicts when
applied against HEAD. There are no notable changes in this v2 compared
to v1.

--
Peter Geoghegan

Attachments:

v2-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v2-0001-Add-page-level-freezing-to-VACUUM.patchDownload+230-135
v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload+369-126
v2-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v2-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload+155-19
v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload+35-15
In reply to: Peter Geoghegan (#13)
Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Aug 31, 2022 at 12:03 AM Peter Geoghegan <pg@bowt.ie> wrote:

Actually I'm also ignoring some subtleties with Multis that could make
this not quite happen, but again, that's only a super obscure corner case.
The idea that just setting vacuum_freeze_min_age = 0 and
vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
in spirit. You don't need to touch vacuum_freeze_table_age (if you did
then you'd get aggressive VACUUMs, and one goal here is to avoid
those whenever possible -- especially aggressive antiwraparound
autovacuums).

Attached is v3. There is a new patch included here -- v3-0004-*patch,
or "Unify aggressive VACUUM with antiwraparound VACUUM". No other
notable changes.

I decided to work on this now because it seems like it might give a
more complete picture of the high level direction that I'm pushing
towards. Perhaps this will make it easier to review the patch series
as a whole, even. The new patch unifies the concept of antiwraparound
VACUUM with the concept of aggressive VACUUM. Now there is only
antiwraparound and regular VACUUM (uh, barring VACUUM FULL). And now
antiwraparound VACUUMs are not limited to antiwraparound autovacuums
-- a manual VACUUM can also be antiwraparound (that's just the new
name for "aggressive").

We will typically only get antiwraparound vacuuming in a regular
VACUUM when the user goes out of their way to get that behavior.
VACUUM FREEZE is the best example. For the most part the
skipping/freezing strategy stuff has a good sense of what matters
already, and shouldn't need to be guided very often.

The patch relegates vacuum_freeze_table_age to a compatibility option,
making its default -1, meaning "just use autovacuum_freeze_max_age". I
always thought that having two table age based GUCs was confusing.
There was a period between 2006 and 2009 when we had
autovacuum_freeze_max_age, but did not yet have
vacuum_freeze_table_age. This change can almost be thought of as a
return to the simpler user interface that existed at that time. Of
course we must not resurrect the problems that vacuum_freeze_table_age
was intended to address (see originating commit 65878185) by mistake.
We need an improved version of the same basic concept, too.

The patch more or less replaces the table-age-aggressive-escalation
concept (previously implemented using vacuum_freeze_table_age) with
new logic that makes lazyvacuum.c's choice of skipping strategy *also*
depend upon table age -- it is now one more factor to be considered.
Both costs and benefits are weighed here. We now give just a little
weight to table age at a relatively early stage (XID-age-wise early),
and escalate from there. As the table's relfrozenxid gets older and
older, we give less and less weight to putting off the cost of
freezing. This general approach is possible because the false
dichotomy that is "aggressive vs non-aggressive" has mostly been
eliminated. This makes things less confusing for users and hackers.

The details of the skipping-strategy-choice algorithm are still
unsettled in v3 (no real change there). ISTM that the important thing
is still the high level concepts. Jeff was slightly puzzled by the
emphasis placed on the cost model/strategy stuff, at least at one
point. Hopefully my intent will be made clearer by the ideas featured
in the new patch. The skipping strategy decision making process isn't
particularly complicated, but it now looks more like an optimization
problem of some kind or other.

It might make sense to go further in the same direction by making
"regular vs aggressive/antiwraparound" into a *strict* continuum. In
other words, it might make sense to get rid of the two remaining cases
where VACUUM conditions its behavior on whether this VACUUM operation
is antiwraparound/aggressive or not. I'm referring to the cleanup lock
skipping behavior around lazy_scan_noprune(), as well as the
PROC_VACUUM_FOR_WRAPAROUND no-auto-cancellation behavior enforced in
autovacuum workers. We will still need to keep roughly the same two
behaviors, but the timelines can be totally different. We must be
reasonably sure that the cure won't be worse than the disease -- I'm
aware of quite a few individual cases that felt that way [1]https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015 is the most high profile example, but I have personally been called in to deal with similar problems in the past -- Peter Geoghegan.
Aggressive interventions can make sense, but they need to be
proportionate to the problem that's right in front of us. "Kicking the
can down the road" is often the safest and most responsible approach
-- it all depends on the details.

[1]: https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015 is the most high profile example, but I have personally been called in to deal with similar problems in the past -- Peter Geoghegan
is the most high profile example, but I have personally been called in
to deal with similar problems in the past
--
Peter Geoghegan

Attachments:

v3-0001-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v3-0001-Add-page-level-freezing-to-VACUUM.patchDownload+231-136
v3-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/x-patch; name=v3-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload+35-15
v3-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/x-patch; name=v3-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload+162-24
v3-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchapplication/x-patch; name=v3-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchDownload+424-500
v3-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/x-patch; name=v3-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload+375-151
In reply to: Peter Geoghegan (#15)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Sep 8, 2022 at 1:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3. There is a new patch included here -- v3-0004-*patch,
or "Unify aggressive VACUUM with antiwraparound VACUUM". No other
notable changes.

I decided to work on this now because it seems like it might give a
more complete picture of the high level direction that I'm pushing
towards. Perhaps this will make it easier to review the patch series
as a whole, even.

This needed to be rebased over the guc.c work recently pushed to HEAD.

Attached is v4. This isn't just to fix bitrot, though; I'm also
including one new patch -- v4-0006-*.patch. This small patch teaches
VACUUM to size dead_items while capping the allocation at the space
required for "scanned_pages * MaxHeapTuplesPerPage" item pointers. In
other words, we now use scanned_pages instead of rel_pages to cap the
size of dead_items, potentially saving quite a lot of memory. There is
no possible downside to this approach, because we already know exactly
how many pages will be scanned from the VM snapshot -- there is zero
added risk of a second pass over the indexes.

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c. It
could help us to replace a simple array of item pointers (the current
dead_items array) with a faster and more space-efficient data
structure. Masahiko Sawada has done a lot of work on this recently, so
this may interest him.

We don't just have up-front knowledge of the total number of
scanned_pages with VM snapshots -- we also have up-front knowledge of
which specific pages will be scanned. So we have reliable information
about the final distribution of dead_items (which specific heap blocks
might have dead_items) right from the start. While this extra
information/context is not a totally complete picture, it still seems
like it could be very useful as a way of driving how some new
dead_items data structure compresses TIDs. That will depend on the
distribution of TIDs -- the final "heap TID key space".

VM snapshots could also make it practical for the new data structure
to spill to disk to avoid multiple index scans/passed by VACUUM.
Perhaps this will result in behavior that's similar to how hash joins
spill to disk -- having 90% of the memory required to do everything
in-memory *usually* has similar performance characteristics to just
doing everything in memory. Most individual TID lookups from
ambulkdelete() will find that the TID *doesn't* need to be deleted --
a little like a hash join with low join selectivity (the common case
for hash joins). It's not like a merge join + sort, where we must
either spill everything or nothing (a merge join can be better than a
hash join with high join selectivity).

--
Peter Geoghegan

Attachments:

v4-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v4-0001-Add-page-level-freezing-to-VACUUM.patchDownload+200-108
v4-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v4-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload+162-24
v4-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchapplication/octet-stream; name=v4-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchDownload+12-15
v4-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v4-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload+35-15
v4-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchapplication/octet-stream; name=v4-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchDownload+424-500
v4-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v4-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload+375-151
#17John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#16)
Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c. It
could help us to replace a simple array of item pointers (the current
dead_items array) with a faster and more space-efficient data
structure. Masahiko Sawada has done a lot of work on this recently, so
this may interest him.

I don't quite see how it helps "enable" that. It'd be more logical to
me to say the VM snapshot *requires* you to think harder about
resource management, since a palloc'd snapshot should surely be
counted as part of the configured memory cap that admins control.
(Commonly, it'll be less than a few dozen MB, so I'll leave that
aside.) Since Masahiko hasn't (to my knowlege) gone as far as
integrating his ideas into vacuum, I'm not sure if the current state
of affairs has some snag that a snapshot will ease, but if there is,
you haven't described what it is.

I do remember your foreshadowing in the radix tree thread a while
back, and I do think it's an intriguing idea to combine pages-to-scan
and dead TIDs in the same data structure. The devil is in the details,
of course. It's worth looking into.

VM snapshots could also make it practical for the new data structure
to spill to disk to avoid multiple index scans/passed by VACUUM.

I'm not sure spilling to disk is solving the right problem (as opposed
to the hash join case, or to the proposed conveyor belt system which
has a broader aim). I've found several times that a customer will ask
if raising maintenance work mem from 1GB to 10GB will make vacuum
faster. Looking at the count of index scans, it's pretty much always
"1", so even if the current approach could scale above 1GB, "no" it
wouldn't help to raise that limit.

Your mileage may vary, of course.

Continuing my customer example, searching the dead TID list faster
*will* make vacuum faster. The proposed tree structure is more memory
efficient, and IIUC could scale beyond 1GB automatically since each
node is a separate allocation, so the answer will be "yes" in the rare
case the current setting is in fact causing multiple index scans.
Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

(But +1 for changing that calculation for as long as we do have the
single array.)

--
John Naylor
EDB: http://www.enterprisedb.com

In reply to: John Naylor (#17)
Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Sep 14, 2022 at 3:18 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c.

I don't quite see how it helps "enable" that.

I have already written a simple throwaway patch that can use the
current VM snapshot data structure (which is just a local copy of the
VM's pages) to do a cheap precheck ahead of actually doing a binary
search in dead_items -- if a TID's heap page is all-visible or
all-frozen (depending on the type of VACUUM) then we're 100%
guaranteed to not visit it, and so it's 100% guaranteed to not have
any dead_items (actually it could have LP_DEAD items by the time the
index scan happens, but they won't be in our dead_items array in any
case). Since we're working off of an immutable source, this
optimization is simple to implement already. Very simple.

I haven't even bothered to benchmark this throwaway patch (I literally
wrote it in 5 minutes to show Masahiko what I meant). I can't see why
even that throwaway prototype wouldn't significantly improve
performance, though. After all, the VM snapshot data structure is far
denser than dead_items, and the largest tables often have most heap
pages skipped via the VM.

I'm not really interested in pursuing this simple approach because it
conflicts with Masahiko's work on the data structure, and there are
other good reasons to expect that to help. Plus I'm already very busy
with what I have here.

It'd be more logical to
me to say the VM snapshot *requires* you to think harder about
resource management, since a palloc'd snapshot should surely be
counted as part of the configured memory cap that admins control.

That's clearly true -- it creates a new problem for resource
management that will need to be solved. But that doesn't mean that it
can't ultimately make resource management better and easier.

Remember, we don't randomly visit some skippable pages for no good
reason in the patch, since the SKIP_PAGES_THRESHOLD stuff is
completely gone. The VM snapshot isn't just a data structure that
vacuumlazy.c uses as it sees fit -- it's actually more like a set of
instructions on which pages to scan, that vacuumlazy.c *must* follow.
There is no way that vacuumlazy.c can accidentally pick up a few extra
dead_items here and there due to concurrent activity that unsets VM
pages. We don't need to leave that to chance -- it is locked in from
the start.

I do remember your foreshadowing in the radix tree thread a while
back, and I do think it's an intriguing idea to combine pages-to-scan
and dead TIDs in the same data structure. The devil is in the details,
of course. It's worth looking into.

Of course.

Looking at the count of index scans, it's pretty much always
"1", so even if the current approach could scale above 1GB, "no" it
wouldn't help to raise that limit.

I agree that multiple index scans are rare. But I also think that
they're disproportionately involved in really problematic cases for
VACUUM. That said, I agree that simply making lookups to dead_items as
fast as possible is the single most important way to improve VACUUM by
improving dead_items.

Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

Andres made remarks about memory usage with sparse dead TID patterns
at this point on the "Improve dead tuple storage for lazy vacuum"
thread:

/messages/by-id/20210710025543.37sizjvgybemkdus@alap3.anarazel.de

I haven't studied the radix tree stuff in great detail, so I am
uncertain of how much the VM snapshot concept could help, and where
exactly it would help. I'm just saying that it seems promising,
especially as a way of addressing concerns like this.

--
Peter Geoghegan

#19John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#18)
Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Sep 14, 2022 at 3:18 AM John Naylor

Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

My point here was simply that spilling to disk is an admission of
failure to utilize memory efficiently and thus shouldn't be a selling
point of VM snapshots. Other selling points could still be valid.

--
John Naylor
EDB: http://www.enterprisedb.com

In reply to: John Naylor (#19)
Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Sep 15, 2022 at 12:09 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

My point here was simply that spilling to disk is an admission of
failure to utilize memory efficiently and thus shouldn't be a selling
point of VM snapshots. Other selling points could still be valid.

I was trying to explain the goals of this work in a way that was as
accessible as possible. It's not easy to get the high level ideas
across, let alone all of the details.

It's true that I have largely ignored the question of how VM snapshots
will need to spill up until now. There are several reasons for this,
most of which you could probably guess. FWIW it wouldn't be at all
difficult to add *some* reasonable spilling behavior very soon; the
underlying access patterns are highly sequential and predictable, in
the obvious way.

--
Peter Geoghegan

#21Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#15)
In reply to: Jeff Davis (#21)
#23Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#22)
In reply to: Jeff Davis (#23)
#25Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#24)
In reply to: Jeff Davis (#25)
#27Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#26)
In reply to: Jeff Davis (#27)
In reply to: Peter Geoghegan (#15)
#30Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Geoghegan (#29)
In reply to: Justin Pryzby (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#31)
In reply to: Andres Freund (#32)
In reply to: Peter Geoghegan (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#34)
In reply to: Andres Freund (#35)
In reply to: Peter Geoghegan (#36)
#38Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#37)
In reply to: Jeff Davis (#38)
#40John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#39)
In reply to: John Naylor (#40)
In reply to: Peter Geoghegan (#41)
#43Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#42)
In reply to: Matthias van de Meent (#43)
#45Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Geoghegan (#37)
In reply to: Justin Pryzby (#45)
#47John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#42)
#48Nikita Malakhov
hukutoc@gmail.com
In reply to: John Naylor (#47)
In reply to: John Naylor (#47)
In reply to: Nikita Malakhov (#48)
In reply to: Peter Geoghegan (#44)
#52Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#51)
#53Nikita Malakhov
hukutoc@gmail.com
In reply to: Jeff Davis (#52)
#54Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#51)
In reply to: Jeff Davis (#54)
In reply to: Peter Geoghegan (#55)
#57Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#56)
In reply to: Jeff Davis (#57)
In reply to: Peter Geoghegan (#58)
In reply to: Peter Geoghegan (#59)
#61Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Peter Geoghegan (#60)
In reply to: Hayato Kuroda (Fujitsu) (#61)
#63Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Peter Geoghegan (#62)
In reply to: Peter Geoghegan (#60)
#65Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#60)
In reply to: Jeff Davis (#65)
In reply to: Peter Geoghegan (#66)
#68Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#67)
In reply to: Jeff Davis (#68)
In reply to: Peter Geoghegan (#69)
#71Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#70)
In reply to: Jeff Davis (#71)
#73Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#72)
#74Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#73)
In reply to: Matthias van de Meent (#73)
In reply to: Matthias van de Meent (#74)
In reply to: Peter Geoghegan (#72)
#78Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter Geoghegan (#77)
In reply to: Dilip Kumar (#78)
In reply to: Peter Geoghegan (#77)
In reply to: Peter Geoghegan (#80)
In reply to: Peter Geoghegan (#79)
#83Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter Geoghegan (#79)
In reply to: Dilip Kumar (#83)
#85Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter Geoghegan (#84)
#86Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#85)
In reply to: Dilip Kumar (#86)
In reply to: Peter Geoghegan (#81)
#89Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#88)
#90Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#88)
#91Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#88)
In reply to: Andres Freund (#90)
#93Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#90)
In reply to: Andres Freund (#91)
In reply to: Andres Freund (#93)
#96Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#92)
#97Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#95)
In reply to: Andres Freund (#96)
#99Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#94)
In reply to: Andres Freund (#99)
#101Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#98)
#102Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#96)
In reply to: Andres Freund (#101)
#104Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#100)
#105John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#101)
In reply to: Robert Haas (#102)
In reply to: John Naylor (#105)
In reply to: Peter Geoghegan (#106)
#109Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#106)
#110Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#104)
In reply to: Andres Freund (#104)
In reply to: Robert Haas (#109)
#113Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#110)
In reply to: Andres Freund (#113)
#115Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#114)
In reply to: Andres Freund (#115)
#117Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#116)
#118Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#112)
In reply to: Robert Haas (#118)
In reply to: Matthias van de Meent (#117)
#121Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#116)
#122Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#119)
#123Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#117)
In reply to: Robert Haas (#122)
#125Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#124)
In reply to: Robert Haas (#125)
In reply to: Andres Freund (#121)
#128Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#118)
#129Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#127)
In reply to: Andres Freund (#129)
#131Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#130)
In reply to: Andres Freund (#131)
#133Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#132)
#134Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#133)
#135Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#126)
#136Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#127)
In reply to: Robert Haas (#135)
#138Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#131)
#139Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#138)
In reply to: Andres Freund (#133)