New strategies for freezing, advancing relfrozenxid early

Started by Peter Geogheganalmost 4 years ago140 messageshackers

pg@bowt.ie

almost 4 years ago

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Recap
=====

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once, causing significant
disruption to production workloads. The patches teach VACUUM to care
about how far behind it is on freezing for each table -- the number of
unfrozen all-visible pages that have accumulated so far is directly
and explicitly kept under control over time. Unfrozen pages can be
seen as debt. There isn't necessarily anything wrong with getting into
debt (getting into debt to a small degree is all but inevitable), but
debt can be dangerous when it isn't managed carefully. Accumulating
large amounts of debt doesn't always end badly, but it does seem to
reliably create the *risk* that things will end badly.

Right now, a standard append-only table could easily do *all* freezing
in aggressive/antiwraparound VACUUM, without any earlier
non-aggressive VACUUM operations triggered by
autovacuum_vacuum_insert_threshold doing any freezing at all (unless
the user goes out of their way to tune vacuum_freeze_min_age). There
is currently no natural limit on the number of unfrozen all-visible
pages that can accumulate -- unless you count age(relfrozenxid), the
triggering condition for antiwraparound autovacuum. But relfrozenxid
age predicts almost nothing about how much freezing is required (or
will be required later on). The overall result is that it oftens takes
far too long for freezing to finally happen, even when the table
receives plenty of autovacuums (they all could freeze something, but
in practice just don't freeze anything). It's very hard to avoid that
through tuning, because what we really care about is something pretty
closely related to (if not exactly) the number of unfrozen heap pages
in the system. XID age is fundamentally "the wrong unit" here -- the
physical cost of freezing is the most important thing, by far.

In short, the goal of the patch series/project is to make autovacuum
scheduling much more predictable over time. Especially with very large
append-only tables. The patches improve the performance stability of
VACUUM by managing costs holistically, over time. What happens in one
single VACUUM operation is much less important than the behavior of
successive VACUUM operations over time.

What's new: freezing/skipping strategies
========================================

This newly overhauled version introduces the concept of
per-VACUUM-operation strategies, which we decide on once per VACUUM,
at the very start. There are 2 choices to be made at this point (right
after we acquire OldestXmin and similar cutoffs):

1) Do we scan all-visible pages, or do we skip instead? (Added by
second patch, involves a trade-off between eagerness and laziness.)
2) How should we freeze -- eagerly or lazily? (Added by third patch)

The strategy-based approach can be thought of as something that blurs
the distinction between aggressive and non-aggressive VACUUM, giving
VACUUM more freedom to do either more or less work, based on known
costs and benefits. This doesn't completely supersede
aggressive/antiwraparound VACUUMs, but should make them much rarer
with larger tables, where controlling freeze debt actually matters.
There is a need to keep laziness and eagerness in balance here. We try
to get the benefit of lazy behaviors/strategies, but will still course
correct when it doesn't work out.

A new GUC/reloption called vacuum_freeze_strategy_threshold is added
to control freezing strategy (also influences our choice of skipping
strategy). It defaults to 4GB, so tables smaller than that cutoff
(which are usually the majority of all tables) will continue to freeze
in much the same way as today by default. Our current lazy approach to
freezing makes sense there, and should be preserved for its own sake.

Compatibility
=============

Structuring the new freezing behavior as an explicit user-configurable
strategy is also useful as a bridge between the old and new freezing
behaviors. It makes it fairly easy to get the old/current behavior
where that's preferred -- which, I must admit, is something that
wasn't well thought through last time around. The
vacuum_freeze_strategy_threshold GUC is effectively (though not
explicitly) a compatibility option. Users that want something close to
the old/current behavior can use the GUC or reloption to more or less
opt-out of the new freezing behavior, and can do so selectively. The
GUC should be easy for users to understand, too -- it's just a table
size cutoff.

Skipping pages using a snapshot of the visibility map
=====================================================

We now take a copy of the visibility map at the point that VACUUM
begins, and work off of that when skipping, instead of working off of
the mutable/authoritative VM -- this is a visibility map snapshot.
This new infrastructure helps us to decide on a skipping strategy.
Every non-aggressive VACUUM operation now has a choice to make: Which
skipping strategy should it use? (This was introduced as
item/question #1 a moment ago.)

The decision on skipping strategy is a decision about our priorities
for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page). If it's the latter, we'll scan exactly 0
all-visible pages. Either way, once a decision has been made, we don't
leave much to chance -- we commit. ISTM that this is the only approach
that really makes sense. Fundamentally, we advance relfrozenxid a
table at a time, and at most once per VACUUM operation. And for larger
tables it's just impossible as a practical matter to have frequent
VACUUM operations. We ought to be *somewhat* biased in the direction
of advancing relfrozenxid by *some* amount during each VACUUM, even
when relfrozenxid isn't all that old right now.

A strategy (whether for skipping or for freezing) is a big, up-front
decision -- and there are certain kinds of risks that naturally
accompany that approach. The information driving the decision had
better be fairly reliable! By using a VM snapshot, we can choose our
skipping strategy based on precise information about how many *extra*
pages we will have to scan if we go with eager scanning/relfrozenxid
advancement. Concurrent activity cannot change what we scan and what
we skip, either -- everything is locked in from the start. That seems
important to me. It justifies trying to advance relfrozenxid early,
just because the added cost of scanning any all-visible pages happens
to be low.

This is quite a big shift for VACUUM, at least in some ways. The patch
adds a DETAIL to the "starting vacuuming" INFO message shown by VACUUM
VERBOSE. The VERBOSE output is already supposed to work as a
rudimentary progress indicator (at least when it is run at the
database level), so it now shows the final scanned_pages up-front,
before the physical scan of the heap even begins:

regression=# vacuum verbose tenk1;
INFO: vacuuming "regression.public.tenk1"
DETAIL: total table size is 486 pages, 3 pages (0.62% of total) must be scanned
INFO: finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 486 remain, 3 scanned (0.62% of total)
*** SNIP ***
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

I included this VERBOSE tweak in the second patch because it became
natural with VM snapshots, and not because it felt particularly
compelling -- scanned_pages just works like this now (an assertion
verifies that our initial scanned_pages is always an exact match to
what happened during the physical scan, in fact).

There are many things that VM snapshots might also enable that aren't
particularly related to freeze debt. VM snapshotting has the potential
to enable more flexible behavior by VACUUM. I'm thinking of things
like suspend-and-resume for VACUUM/autovacuum, or even autovacuum
scheduling that coordinates autovacuum workers before and during
processing by vacuumlazy.c. Locking-in scanned_pages up-front avoids
the main downside that comes with throttling VACUUM right now: the
fact that simply taking our time during VACUUM will tend to increase
the number of concurrently modified pages that we end up scanning.
These pages are bound to mostly just contain "recently dead" tuples
that the ongoing VACUUM can't do much about anyway -- we could dirty a
lot more heap pages as a result, for little to no benefit.

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

Reducing the WAL space overhead of freezing
===========================================

Not included in this new v1 are other patches that control the
overhead of added freezing -- my focus since joining AWS has been to
get these more strategic patches in shape, and telling the right story
about what I'm trying to do here. I'm going to say a little on the
patches that I have in the pipeline here, though. Getting the
low-level/mechanical overhead of freezing under control will probably
require a few complementary techniques, not just high-level strategies
(though the strategy stuff is the most important piece).

The really interesting omitted-in-v1 patch adds deduplication of
xl_heap_freeze_page WAL records. This reduces the space overhead of
WAL records used to freeze by ~5x in most cases. It works in the
obvious way: we just store the 12 byte freeze plans that appear in
each xl_heap_freeze_page record only once, and then store an array of
item offset numbers for each entry (rather than naively storing a full
12 bytes per tuple frozen per page-level WAL record). This means that
we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen
(2 bytes for an OffsetNumber) once we decide to freeze something on
the same page. The *marginal* cost can be much lower than it is today,
which makes page-based batching of freezing much more compelling IMV.

Thoughts?
--
Peter Geoghegan

Jeremy Schneider

schnjere@amazon.com

almost 4 years ago

In reply to: Peter Geoghegan (#1)

Re: New strategies for freezing, advancing relfrozenxid early

On 8/25/22 2:21 PM, Peter Geoghegan wrote:

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

We should be careful here. IIUC, the current autovac behavior helps
bound the "spread" or range of active multixact IDs in the system, which
directly determines the number of distinct pages that contain those
multixacts. If the proposed change herein causes the spread/range of
MXIDs to significantly increase, then it will increase the number of
blocks and increase the probability of thrashing on the SLRUs for these
data structures. There may be another separate thread or two about
issues with SLRUs already?

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

New strategies for freezing, advancing relfrozenxid early

Attachments:

Attachments:

Attachments:

Attachments: