decoupling table and index vacuum

Started by Robert Haasalmost 5 years ago66 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

Hi,

We are used to thinking about table vacuum and index vacuum as parts
of a single, indivisible operation. You vacuum the table -- among
other things by performing HOT pruning and remembering dead TIDs --
and then you vacuum the indexes -- removing the remembered TIDs from
the index -- and then you vacuum the table some more, setting those
dead TIDs unused -- and then you're done. And along the way you do
some other things too like considering truncation that aren't relevant
to the point I want to make here. Now, the problem with this is that
every index has its own needs, which are separate from the needs of
the tables, as I think Peter Geoghegan and Masahiko Sawada were also
discussing recently. Opportunistic index cleanup strategies like
kill_prior_tuple and bottom-up deletion may work much better for some
indexes than others, meaning that you could have some indexes that
badly need to be vacuumed because they are full of garbage, and other
indexes on the same table where the opportunistic cleanup has worked
perfectly and there is no need for vacuuming at all. Separately, the
table may or may not need to get some dead pointers set back to unused
to avoid table bloat.

But, as things stand today, strategy options to deal with such
situations are limited. Leaving aside what the code actually does
right now, let's talk about what options we have in theory with the
technology as it exists now. They basically all boil down to stopping
early and then redoing the work later. We must always start with a
pass over the heap to collect dead TIDs, because otherwise there's
nothing else to do. Now we can choose to stop, but then the next
VACUUM will have to collect all those TIDs again. It may get to skip
more all-visible pages than the current vacuum did, but the pages that
still have dead TIDs will all have to be visited again. If we don't
stop, then we can choose to vacuum all of the indexes or just some of
them, and then afterwards stop. But if we do this, the next VACUUM
will have to scan all indexes again for the same TIDs. Here, we don't
even have the visibility map to allow skipping known-clean pages, so
it's *really* a lot of work we have to redo. Thus what we normally do
is press on to the third step, where we mark dead line pointers unused
after scanning every index in its entirety, and now they're gone and
we don't have to worry about them again. Barring emergency escape
valves, as things stand today, the frequency of table vacuuming is the
same as the frequency of index vacuuming, even though the *required*
frequency of vacuuming is not the same, and also varies from index to
index.

Now, the reason for this is that when we discover dead TIDs, we only
record them in memory, not on disk. So, as soon as VACUUM ends, we
lose all knowledge of whether those TIDs were and must rediscover
them. Suppose we didn't do this, and instead had a "dead TID" fork for
each table. Suppose further that this worked like a conveyor belt,
similar to WAL, where every dead TID we store into the fork is
assigned an identifying 64-bit number that is never reused. Then,
suppose that for each index, we store the number of the oldest entry
that might still need to be vacuumed from the index. Every time you
perform what we now call the first heap pass of a VACUUM, you add the
new TIDs you find to the dead TID fork. Every time you vacuum an
index, the TIDs that need to be removed are those between the
oldest-entry pointer for that index and the current end of the TID
fork. You remove all of those and then advance your oldest-entry
pointer accordingly. If that's too many TIDs to fit in
maintenance_work_mem, you can just read as many as will fit and
advance your oldest-entry pointer less far. Every time you perform
what we now call the second heap pass of a VACUUM, you find all the
TIDs that precede every index's oldest-entry pointer and set them
unused. You then throw away the associated storage at the OS level.
This requires a scheme where relations can be efficiently truncated
from the beginning rather than only at the end, which is why I said "a
conveyor belt" and "similar to WAL". Details deliberately vague since
I am just brainstorming here.

This scheme adds a lot of complexity, which is a concern, but it seems
to me that it might have several benefits. One is concurrency. You
could have one process gathering dead TIDs and adding them to the
dead-TID fork while another process is vacuuming previously-gathered
TIDs from some index. In fact, every index could be getting vacuumed
at the same time, and different indexes could be removing different
TID ranges. At the same time, you could have another process setting
dead TIDs that all indexes have previously removed to unused.
Furthermore, all of these operations can start in any order, and any
of them can be repeated any number of times during a single run of any
particular other one, or indeed, without that particular one ever
being run at all. Both heap phases can also now be done in smaller
chunks, if desired. You can gather TIDs from a portion of the table
and remember where you left off, and come back and pick up from that
point later, if you wish. You can likewise pick a subset of
dead-TIDs-retired-from-all-indexes to set unused, and do just that
many, and then at a later time come back and do some more. Also, you
can now make mostly-independent decisions about how to perform each of
these operations, too. It's not completely independent: if you need to
set some dead TIDs in the table to unused, you may have to force index
vacuuming that isn't needed for bloat control. However, you only need
to force it for indexes that haven't been vacuumed recently enough for
some other reason, rather than every index. If you have a target of
reclaiming 30,000 TIDs, you can just pick the indexes where there are
fewer than 30,000 dead TIDs behind their oldest-entry pointers and
force vacuuming only of those. By the time that's done, there will be
at least 30,000 dead line pointers you can mark unused, and maybe
more, minus whatever reclamation someone else did concurrently.

But is this worthwhile? I think it depends a lot on what you think the
comparative required frequencies are for the various operations. If
index A needs to be vacuumed every 40 minutes and index B needs to be
vacuumed every 55 minutes and the table that owns both of them needs
to be vacuumed every 70 minutes, I am not sure there is a whole lot
here. I think you will be pretty much equally well off if you just do
what we do today every 40 minutes and call it good. Also, you will not
benefit very much if the driving force is reclaiming dead line
pointers in the table itself. If that has to happen frequently, then
the indexes have to be scanned frequently, and this whole thing is a
lot of work for not much. But, maybe that's not the case. Suppose
index A needs to be vacuumed every hour to avoid bloat, index B needs
to be vacuumed every 4 hours to avoid bloat, and the table needs dead
line pointers reclaimed every 5.5 hours. Well, now you can gain a lot.
You can vacuum index A frequently while vacuuming index B only as
often as it needs, and you can reclaim dead line pointers on their own
schedule based on whatever index vacuuming was already done for bloat
avoidance. Without this scheme, there's just no way to give everybody
what they need without some of the participants being "dragged along
for the ride" and forced into work that they don't actually need done
simply because "that's how it works."

One thing I don't know is whether the kind of scenario that I describe
above is common, i.e. is the main reason we need to vacuum to control
index bloat, where this kind of approach seems likely to help, or is
it to reclaim dead line pointers in the heap, where it's not? I'd be
interested in hearing from people who have some experience in this
area, or at least better intuition than I do.

I'm interested in this idea partly because I think it would be much
more likely to help in a hypothetical world where we had global
indexes. Imagine a partitioned table where each partition has a local
index and a then there is also a global index which indexes tuples
from every partition. Waving away the difficulty of making such a
thing work, there's a vacuuming problem here, which has been discussed
before. In short, if you tried to handle this in the naive way, you'd
end up having to vacuum the global index every time you vacuumed any
partition. That sucks. Suppose that there are 1000 partitions, each
partition is 1GB, and each local index is 50MB. All things being
equal, the global index should end up being about as big as all of the
local indexes put together, which in this case would be 50GB. Clearly,
we do not want to vacuum each partition by scanning the 1GB partition
+ the 50MB local index + the 50GB global index. That's insane. With
the above system, since everything's decoupled, you can vacuum the
partition tables individually as often as required, and similarly for
their local indexes, but put off vacuuming the global index until
you've vacuumed a bunch, maybe all, of the partitions, so that the
work of cleaning up the global index cleans up dead TIDs from many or
all partitions instead of just one at a time.

Now, the fly in the ointment here is that this supposes that we don't
get forced into vacuuming the global index too quickly because of dead
line pointer accumulation. But, I think if that does happen, with
careful scheduling, we might not really be worse off than we would
have been without partitioning. If we scan the table for just one
partition and, say, exhaust maintenance_work_mem, we have some
expensive index vacuuming to do immediately, but that would've also
happened in pretty much the same way with an unpartitioned table. If
we don't fill maintenance_work_mem but we do notice that the table for
this partition is full of dead line pointers that we need to reclaim,
we can still choose to scan some other partitions and collect some
more dead TIDs before cleaning the global index. That could delay
actually getting those line pointers reclaimed, but an unpartitioned
table would have suffered from at least as much delay, because it
wouldn't even consider the possibility of stopping before scanning
every available table page, and we could choose to stop after dealing
with only some partitions but not all. It's probably tricky to get the
autovacuum algorithm right here, but there seems to be some room for
optimism.

Even if global indexes never happened, though, I think this could have
other benefits. For example, the wraparound failsafe mechanism
recently added by Masahiko Sawada and Peter Geoghegan bypasses index
vacuuming when wraparound danger is imminent. The only problem is that
making that decision means throwing away the accumulated list of dead
TIDs, which then need to be rediscovered whenever we get around to
vacuuming the indexes. But that's avoidable, if they're stored on disk
rather than in RAM.

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place. A related
objection is that if it's sometimes agreable to do everything all at
once as we currently do, the I/O overhead could be avoided. I think
we'd probably have to retain a code path that buffers the dead TIDs in
memory to account, at least, for the low-on-disk-space case, and maybe
that can also be used to avoid I/O in some other cases, too. I haven't
thought through all the details here. It seems to me that the actual
I/O avoidance is probably not all that much - each dead TID is much
smaller than the deleted tuple that gave rise to it, and typically you
don't delete all the tuples at once - but it might be material in some
cases, and it's definitely material if you don't have enough disk
space left for it to complete without error.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

#2Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: decoupling table and index vacuum

Hi,

On 2021-04-21 11:21:31 -0400, Robert Haas wrote:

Opportunistic index cleanup strategies like
kill_prior_tuple and bottom-up deletion may work much better for some
indexes than others, meaning that you could have some indexes that
badly need to be vacuumed because they are full of garbage, and other
indexes on the same table where the opportunistic cleanup has worked
perfectly and there is no need for vacuuming at all.

Partial indexes are another case that can lead to individual indexes
being without bloat, with others severely bloated.

This requires a scheme where relations can be efficiently truncated
from the beginning rather than only at the end, which is why I said "a
conveyor belt" and "similar to WAL". Details deliberately vague since
I am just brainstorming here.

I'm not sure that's the only way to deal with this. While some form of
generic "conveyor belt" infrastructure would be a useful building block,
and it'd be sensible to use it here if it existed, it seems feasible to
dead tids in a different way here. You could e.g. have per-heap-vacuum
files with a header containing LSNs that indicate the age of the
contents.

This scheme adds a lot of complexity, which is a concern, but it seems
to me that it might have several benefits. One is concurrency. You
could have one process gathering dead TIDs and adding them to the
dead-TID fork while another process is vacuuming previously-gathered
TIDs from some index.

I think it might even open the door to using multiple processes
gathering dead TIDs for the same relation.

In fact, every index could be getting vacuumed at the same time, and
different indexes could be removing different TID ranges.

We kind of have this feature right now, due to parallel vacuum...

It's not completely independent: if you need to set some dead TIDs in
the table to unused, you may have to force index vacuuming that isn't
needed for bloat control. However, you only need to force it for
indexes that haven't been vacuumed recently enough for some other
reason, rather than every index.

Hm - how would we know how recently that TID has been marked dead? We
don't even have xids for dead ItemIds... Maybe you're intending to
answer that in your next paragraph, but it's not obvious to me that'd be
sufficient...

If you have a target of reclaiming 30,000 TIDs, you can just pick the
indexes where there are fewer than 30,000 dead TIDs behind their
oldest-entry pointers and force vacuuming only of those. By the time
that's done, there will be at least 30,000 dead line pointers you can
mark unused, and maybe more, minus whatever reclamation someone else
did concurrently.

One thing that you didn't mention so far is that this'd allow us to add
dead TIDs to the "dead tid" file outside of vacuum too. In some
workloads most of the dead tuple removal happens as part of on-access
HOT pruning. While some indexes are likely to see that via the
killtuples logic, others may not. Being able to have more aggressive
index vacuum for the one or two bloated index, without needing to rescan
the heap, seems like it'd be a significant improvement.

Suppose index A needs to be vacuumed every hour to avoid bloat, index
B needs to be vacuumed every 4 hours to avoid bloat, and the table
needs dead line pointers reclaimed every 5.5 hours. Well, now you can
gain a lot. You can vacuum index A frequently while vacuuming index B
only as often as it needs, and you can reclaim dead line pointers on
their own schedule based on whatever index vacuuming was already done
for bloat avoidance. Without this scheme, there's just no way to give
everybody what they need without some of the participants being
"dragged along for the ride" and forced into work that they don't
actually need done simply because "that's how it works."

Have you thought about how we would do the scheduling of vacuums for the
different indexes? We don't really have useful stats for the number of
dead index entries to be expected in an index. It'd not be hard to track
how many entries are removed in an index via killtuples, but
e.g. estimating how many dead entries there are in a partial index seems
quite hard (at least without introducing significant overhead).

One thing I don't know is whether the kind of scenario that I describe
above is common, i.e. is the main reason we need to vacuum to control
index bloat, where this kind of approach seems likely to help, or is
it to reclaim dead line pointers in the heap, where it's not? I'd be
interested in hearing from people who have some experience in this
area, or at least better intuition than I do.

I think doing something like this has a fair bit of potential. Being
able to perform freezing independently of index scans, without needing
to scan the table again to re-discover dead line item pointers seems
like it'd be a win. More aggressive/targeted index vacuum in cases where
most tuples are removed via HOT pruning seems like a win. Not having to
restart from scratch after a cancelled autvacuum would be a
win. Additional parallelization seems like a win...

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place.

I suspect we'd need a global limit of space used for this data. If above
that limit we'd switch to immediately performing the work required to
remove some of that space.

A related objection is that if it's sometimes agreable to do
everything all at once as we currently do, the I/O overhead could be
avoided. I think we'd probably have to retain a code path that buffers
the dead TIDs in memory to account, at least, for the
low-on-disk-space case, and maybe that can also be used to avoid I/O
in some other cases, too.

We'd likely want to do some batching of insertions into the "dead tid"
map - which'd probably end up looking similar to a purely in-memory path
anyway.

Greetings,

Andres Freund

In reply to: Robert Haas (#1)
Re: decoupling table and index vacuum

On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote:

We are used to thinking about table vacuum and index vacuum as parts
of a single, indivisible operation. You vacuum the table -- among
other things by performing HOT pruning and remembering dead TIDs --
and then you vacuum the indexes -- removing the remembered TIDs from
the index -- and then you vacuum the table some more, setting those
dead TIDs unused -- and then you're done. And along the way you do
some other things too like considering truncation that aren't relevant
to the point I want to make here. Now, the problem with this is that
every index has its own needs, which are separate from the needs of
the tables, as I think Peter Geoghegan and Masahiko Sawada were also
discussing recently.

I'm very happy to see that you've taken an interest in this work! I
believe it's an important area. It's too important to be left to only
two contributors. I welcome your participation as an equal partner in
the broader project to fix problems with VACUUM.

Masahiko and I have had plenty of ideas about where this could go next
-- way too many ideas, in fact. Maybe that kind of partnership sounds
unnecessary or at least seems premature, but it turns out that this
area is extremely broad and far reaching, if you really think it
through -- you end up having to negotiate rather a lot all at once.
Apart from anything else, I simply don't have the authority to commit
a bunch of stuff that implicitly makes Postgres do things a certain
way in a huge number of different subsystems. (Whether or not I'd be
right in each case is beside the point.)

My most ambitious goal is finding a way to remove the need to freeze
or to set hint bits. I think that we can do this by inventing a new
kind of VACUUM just for aborted transactions, which doesn't do index
vacuuming. You'd need something like an ARIES-style dirty page table
to make this cheap -- so it's a little like UNDO, but not very much.
The basic idea is that eagerly cleaning up aborted transactions in an
autovacuum worker allows you to broadly assume that most blocks
contain definitely-committed heap tuples, or else LP_DEAD stubs (which
of course don't contain any XIDs). You'd still have something like
conventional VACUUM, which wouldn't change that much. Freezing is
largely implicit, but maybe you freeze tuples the old way if and only
if a backend dirties a "known-all-commited" block -- that can still be
expensive.

The visibility map still has an all-visible bit, but now it also has
an all-committed bit (or maybe it's a separate data structure). The
combination of all-visible and all-committed to precisely the same as
frozen, so you don't need a separate VM bit for that anymore.

Notice that this design doesn't change much about our basic approach
to transaction management. It just further decouples things.
Conventional VACUUMs are now only about garbage collection, and so can
be further optimized with that goal in mind. It's much easier to do
clever scheduling if VACUUM really only has to do garbage collection.

Opportunistic index cleanup strategies like
kill_prior_tuple and bottom-up deletion may work much better for some
indexes than others, meaning that you could have some indexes that
badly need to be vacuumed because they are full of garbage, and other
indexes on the same table where the opportunistic cleanup has worked
perfectly and there is no need for vacuuming at all.

I know I say this all the time these days, but it seems worth
repeating now: it is a qualitative difference, not a quantitative
difference. Bottom-up index deletion will frequently stop most indexes
on a table from growing by even one single block, while indexes that
cannot use the optimization (indexes that are logically modified by
UPDATE statements) might be hugely bloated. If this is the case during
one VACUUM operation, it's probably going to work like that with all
future VACUUM operations. It's abundantly clear that the current
quantitative approach just cannot be pushed much further.

But, as things stand today, strategy options to deal with such
situations are limited. Leaving aside what the code actually does
right now, let's talk about what options we have in theory with the
technology as it exists now. They basically all boil down to stopping
early and then redoing the work later. We must always start with a
pass over the heap to collect dead TIDs, because otherwise there's
nothing else to do. Now we can choose to stop, but then the next
VACUUM will have to collect all those TIDs again. It may get to skip
more all-visible pages than the current vacuum did, but the pages that
still have dead TIDs will all have to be visited again. If we don't
stop, then we can choose to vacuum all of the indexes or just some of
them, and then afterwards stop. But if we do this, the next VACUUM
will have to scan all indexes again for the same TIDs. Here, we don't
even have the visibility map to allow skipping known-clean pages, so
it's *really* a lot of work we have to redo. Thus what we normally do
is press on to the third step, where we mark dead line pointers unused
after scanning every index in its entirety, and now they're gone and
we don't have to worry about them again. Barring emergency escape
valves, as things stand today, the frequency of table vacuuming is the
same as the frequency of index vacuuming, even though the *required*
frequency of vacuuming is not the same, and also varies from index to
index.

I'm 100% in agreement about all of this.

Now, the reason for this is that when we discover dead TIDs, we only
record them in memory, not on disk. So, as soon as VACUUM ends, we
lose all knowledge of whether those TIDs were and must rediscover
them. Suppose we didn't do this, and instead had a "dead TID" fork for
each table.

I had a similar idea myself recently -- clearly remembering the TIDs
that you haven't vacuumed to save work later on makes a lot of sense.
I didn't get very far with it, even in my own head, but I like the
direction you're taking it. Having it work a little like a queue makes
a lot of sense.

Suppose further that this worked like a conveyor belt,
similar to WAL, where every dead TID we store into the fork is
assigned an identifying 64-bit number that is never reused. Then,
suppose that for each index, we store the number of the oldest entry
that might still need to be vacuumed from the index. Every time you
perform what we now call the first heap pass of a VACUUM, you add the
new TIDs you find to the dead TID fork.

Maybe we can combine this known-dead-tid structure with the visibility
map. Index-only scans might be able to reason about blocks that are
mostly all-visible, but still have stub LP_DEAD line pointers that
this other structure knows about -- so you can get index-only scans
without requiring a full round of traditional vacuuming. Maybe there
is some opportunity like that, but not sure how to fit it in to
everything else right now.

Every time you vacuum an
index, the TIDs that need to be removed are those between the
oldest-entry pointer for that index and the current end of the TID
fork. You remove all of those and then advance your oldest-entry
pointer accordingly. If that's too many TIDs to fit in
maintenance_work_mem, you can just read as many as will fit and
advance your oldest-entry pointer less far. Every time you perform
what we now call the second heap pass of a VACUUM, you find all the
TIDs that precede every index's oldest-entry pointer and set them
unused. You then throw away the associated storage at the OS level.
This requires a scheme where relations can be efficiently truncated
from the beginning rather than only at the end, which is why I said "a
conveyor belt" and "similar to WAL". Details deliberately vague since
I am just brainstorming here.

This amounts to adding yet more decoupling -- which seems great to me.
Anything that gives us the option but not the obligation to perform
work either more lazily or more eagerly (whichever makes sense) seems
highly desirable to me. Especially if we can delay our decision until
the last possible point, when we can have relatively high confidence
that we know what we're doing. And especially if holding it open as an
option is pretty cheap (that's the point of remembering dead TIDs).

Furthermore, all of these operations can start in any order, and any
of them can be repeated any number of times during a single run of any
particular other one, or indeed, without that particular one ever
being run at all. Both heap phases can also now be done in smaller
chunks, if desired.

But is this worthwhile? I think it depends a lot on what you think the
comparative required frequencies are for the various operations.

There is a risk that you'll never think that any optimization is worth
it because each optimization seems marginal in isolation. Sometimes a
diversity of strategies is the real strategy. Let's say you have a
bunch of options that you can pick and choose from, with fallbacks and
with ways of course correcting even halfway through the VACUUM. It's
possible that that will work wonderfully well for a given complex user
workload, but if you subtract away *any one* of the strategies
suddenly things get much worse in some obvious high-level way. It's
entirely possible for a single table to have different needs in
different parts of the table.

Certainly works that way with indexes -- that much I can say for sure.

If index A needs to be vacuumed every 40 minutes and index B needs to be
vacuumed every 55 minutes and the table that owns both of them needs
to be vacuumed every 70 minutes, I am not sure there is a whole lot
here. I think you will be pretty much equally well off if you just do
what we do today every 40 minutes and call it good.

That's probably all true, but I think that an excellent heuristic is
to work hard to avoid really terrible outcomes, rather than trying
hard to get good outcomes. The extremes don't just matter -- they may
even be the only thing that matters.

If index A needs to be vacuumed about as frequently as index B anyway,
then the user happens to naturally be in a position where the current
simplistic scheduling works for them. Which is fine, as far as it
goes, but that's not really where we have problems. If you consider
the "qualitative, not quantitative" perspective, things change. It's
now pretty unlikely that all of the indexes on the same table will
have approximately the same needs -- except when there is very little
to do with indexes anyway, which is pretty much not interesting
anyway. Because workloads generally don't logically modify all indexes
columns within each UPDATE. They just don't tend to look like that in
practice.

Also, you will not
benefit very much if the driving force is reclaiming dead line
pointers in the table itself. If that has to happen frequently, then
the indexes have to be scanned frequently, and this whole thing is a
lot of work for not much. But, maybe that's not the case. Suppose
index A needs to be vacuumed every hour to avoid bloat, index B needs
to be vacuumed every 4 hours to avoid bloat, and the table needs dead
line pointers reclaimed every 5.5 hours. Well, now you can gain a lot.
You can vacuum index A frequently while vacuuming index B only as
often as it needs, and you can reclaim dead line pointers on their own
schedule based on whatever index vacuuming was already done for bloat
avoidance. Without this scheme, there's just no way to give everybody
what they need without some of the participants being "dragged along
for the ride" and forced into work that they don't actually need done
simply because "that's how it works."

Right. And, the differences between index A and index B will tend to
be pretty consistent and often much larger than this.

Many indexes would never have to be vacuumed, even with non-HOT
UPDATES due to bottom-up index deletion -- because they literally
won't even have one single page split for hours, while maybe one index
gets 3x larger in the same timeframe. Eventually you'll need to vacuum
the indexes all the same (not just the bloated index), but that's only
required to enable safely performing heap vacuuming. It's not so bad
if the non-bloated indexes won't be dirtied and if it's not so
frequent (dirtying pages is the real cost to keep under control here).

One thing I don't know is whether the kind of scenario that I describe
above is common, i.e. is the main reason we need to vacuum to control
index bloat, where this kind of approach seems likely to help, or is
it to reclaim dead line pointers in the heap, where it's not? I'd be
interested in hearing from people who have some experience in this
area, or at least better intuition than I do.

The paradox here is:

1. Workload characteristics are important and must be exploited to get
optimal performance.

2. Workloads are too complicated and unpredictable to ever truly understand.

Roughly speaking, the solution that I think has the most promise is to
make it okay for your heuristics to be wrong. You do this by keeping
the costs simple, fixed and low, and by doing things that have
multiple benefits (not just one). This is why it's important to give
VACUUM a bunch of strategies that it can choose from and switch back
and forth from, with minimal commitment -- you let VACUUM figure out
what to do about the workload through trial and error. It has to try
and fail on occasion -- you must be willing to pay the cost of
negative feedback (though the cost must be carefully managed). This
approach is perhaps sufficient to cover all of the possible extremes
with all workloads. I think that the extremes are where our problems
all are, or close to it.

The cost shouldn't be terribly noticeable because you have the
flexibility to change your mind at the first sign of an issue. So you
never pay an extreme cost (you pay a pretty low fixed cost
incrementally, at worst), but you do sometimes (and maybe even often)
get an extreme benefit -- the benefit of avoiding current pathological
performance issues. We know that the cost of bloat is very non-linear
in a bunch of ways that can be pretty well understood, so that seems
like the right thing to focus on -- this is perhaps the only thing
that we can expect to understand with a relatively high degree of
confidence. We can live with a lot of uncertainty about what's going
on with the workload by managing it continually, ramping up and down,
etc.

Clearly,
we do not want to vacuum each partition by scanning the 1GB partition
+ the 50MB local index + the 50GB global index. That's insane. With
the above system, since everything's decoupled, you can vacuum the
partition tables individually as often as required, and similarly for
their local indexes, but put off vacuuming the global index until
you've vacuumed a bunch, maybe all, of the partitions, so that the
work of cleaning up the global index cleans up dead TIDs from many or
all partitions instead of just one at a time.

I can't think of any other way of sensibly implementing global indexes.

Now, the fly in the ointment here is that this supposes that we don't
get forced into vacuuming the global index too quickly because of dead
line pointer accumulation. But, I think if that does happen, with
careful scheduling, we might not really be worse off than we would
have been without partitioning. If we scan the table for just one
partition and, say, exhaust maintenance_work_mem, we have some
expensive index vacuuming to do immediately, but that would've also
happened in pretty much the same way with an unpartitioned table.

But you can at least drop the partitions with a global index. It
shouldn't be too hard to make that work without breaking things.

It's probably tricky to get the
autovacuum algorithm right here, but there seems to be some room for
optimism.

I think that it's basically okay if global indexes suck when you do
lots of UPDATEs -- this is a limitation that users can probably live
with. What's not okay is if they suck when you do relatively few
UPDATEs. It's especially not okay to have to scan the global index to
delete one index tuple that points to one LP_DEAD item. Since you tend
to get a tiny number of LP_DEAD items even when the DBA bends over
backwards to make all UPDATEs use HOT. Getting that to happen 99%+ of
the time is so much easier than getting it to happen 100% of the time.
There can be enormous asymmetry with this stuff.

Long term, I see VACUUM evolving into something that can only be
configured in an advisory way. It's too hard to tune this stuff
because what we really want here is to structure many things as an
optimization problem, and to have a holistic view that considers how
the table changes over time -- over multiple VACUUM operations. We can
safely be very lazy if we have some basic sense of proportion about
what the risk is. For example, maybe we limit the number of newly
dirtied pages during VACUUM by being lazy about pruning pages that
don't happen to be dirty when encountered within VACUUM. We still have
some sense of how much work we've put off, so as to never get in over
our head with debt. We might also have a sense of how many dirty pages
in total there are in the system currently -- maybe if the DB is not
busy right now we can be extra aggressive. In short, we apply holistic
thinking.

Even if global indexes never happened, though, I think this could have
other benefits. For example, the wraparound failsafe mechanism
recently added by Masahiko Sawada and Peter Geoghegan bypasses index
vacuuming when wraparound danger is imminent. The only problem is that
making that decision means throwing away the accumulated list of dead
TIDs, which then need to be rediscovered whenever we get around to
vacuuming the indexes. But that's avoidable, if they're stored on disk
rather than in RAM.

Yeah, that's not ideal.

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place. A related
objection is that if it's sometimes agreable to do everything all at
once as we currently do, the I/O overhead could be avoided.

Of course it's possible that what we currently do will be optimal. But
it's pretty much a question of mostly-independent things all going the
same way. So I expect that it will be rare.

I think
we'd probably have to retain a code path that buffers the dead TIDs in
memory to account, at least, for the low-on-disk-space case, and maybe
that can also be used to avoid I/O in some other cases, too. I haven't
thought through all the details here. It seems to me that the actual
I/O avoidance is probably not all that much - each dead TID is much
smaller than the deleted tuple that gave rise to it, and typically you
don't delete all the tuples at once - but it might be material in some
cases, and it's definitely material if you don't have enough disk
space left for it to complete without error.

All true.

--
Peter Geoghegan

#4Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#1)
Re: decoupling table and index vacuum

On Wed, Apr 21, 2021 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

Now, the reason for this is that when we discover dead TIDs, we only
record them in memory, not on disk. So, as soon as VACUUM ends, we
lose all knowledge of whether those TIDs were and must rediscover
them. Suppose we didn't do this, and instead had a "dead TID" fork for
each table.

Interesting idea.

However, you only need

to force it for indexes that haven't been vacuumed recently enough for
some other reason, rather than every index. If you have a target of
reclaiming 30,000 TIDs, you can just pick the indexes where there are
fewer than 30,000 dead TIDs behind their oldest-entry pointers and
force vacuuming only of those.

How do we decide this target, I mean at a given point how do we decide
that what is the limit of dead TID's at which we want to trigger the
index vacuuming?

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place. A related
objection is that if it's sometimes agreable to do everything all at
once as we currently do, the I/O overhead could be avoided. I think
we'd probably have to retain a code path that buffers the dead TIDs in
memory to account, at least, for the low-on-disk-space case, and maybe
that can also be used to avoid I/O in some other cases, too. I haven't
thought through all the details here. It seems to me that the actual
I/O avoidance is probably not all that much - each dead TID is much
smaller than the deleted tuple that gave rise to it, and typically you
don't delete all the tuples at once - but it might be material in some
cases, and it's definitely material if you don't have enough disk
space left for it to complete without error.

Is it a good idea to always perform an I/O after collecting the dead
TID's or there should be an option where the user can configure so
that it aggressively vacuum all the indexes and this I/O overhead can
be avoided completely.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#5Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Geoghegan (#3)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 8:51 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote:

We are used to thinking about table vacuum and index vacuum as parts
of a single, indivisible operation. You vacuum the table -- among
other things by performing HOT pruning and remembering dead TIDs --
and then you vacuum the indexes -- removing the remembered TIDs from
the index -- and then you vacuum the table some more, setting those
dead TIDs unused -- and then you're done. And along the way you do
some other things too like considering truncation that aren't relevant
to the point I want to make here. Now, the problem with this is that
every index has its own needs, which are separate from the needs of
the tables, as I think Peter Geoghegan and Masahiko Sawada were also
discussing recently.

I'm very happy to see that you've taken an interest in this work! I
believe it's an important area. It's too important to be left to only
two contributors. I welcome your participation as an equal partner in
the broader project to fix problems with VACUUM.

+many

Now, the reason for this is that when we discover dead TIDs, we only
record them in memory, not on disk. So, as soon as VACUUM ends, we
lose all knowledge of whether those TIDs were and must rediscover
them. Suppose we didn't do this, and instead had a "dead TID" fork for
each table.

I had a similar idea myself recently -- clearly remembering the TIDs
that you haven't vacuumed to save work later on makes a lot of sense.
I didn't get very far with it, even in my own head, but I like the
direction you're taking it. Having it work a little like a queue makes
a lot of sense.

Agreed. (I now remembered I gave a talk about a similar idea at PGCon
a couple years ago).

Another good point of this "dead TID fork" design is that IIUC we
don't necessarily need to make it crash-safe. We would not need WAL
logging for remembering dead TIDs. If the server crashes, we can
simply throw it away and assume we haven't done the first heap pass
yet.

Suppose further that this worked like a conveyor belt,
similar to WAL, where every dead TID we store into the fork is
assigned an identifying 64-bit number that is never reused. Then,
suppose that for each index, we store the number of the oldest entry
that might still need to be vacuumed from the index. Every time you
perform what we now call the first heap pass of a VACUUM, you add the
new TIDs you find to the dead TID fork.

Maybe we can combine this known-dead-tid structure with the visibility
map. Index-only scans might be able to reason about blocks that are
mostly all-visible, but still have stub LP_DEAD line pointers that
this other structure knows about -- so you can get index-only scans
without requiring a full round of traditional vacuuming. Maybe there
is some opportunity like that, but not sure how to fit it in to
everything else right now.

Interesting idea.

Every time you vacuum an
index, the TIDs that need to be removed are those between the
oldest-entry pointer for that index and the current end of the TID
fork. You remove all of those and then advance your oldest-entry
pointer accordingly. If that's too many TIDs to fit in
maintenance_work_mem, you can just read as many as will fit and
advance your oldest-entry pointer less far. Every time you perform
what we now call the second heap pass of a VACUUM, you find all the
TIDs that precede every index's oldest-entry pointer and set them
unused. You then throw away the associated storage at the OS level.
This requires a scheme where relations can be efficiently truncated
from the beginning rather than only at the end, which is why I said "a
conveyor belt" and "similar to WAL". Details deliberately vague since
I am just brainstorming here.

The dead TID fork needs to also be efficiently searched. If the heap
scan runs twice, the collected dead TIDs on each heap pass could be
overlapped. But we would not be able to merge them if we did index
vacuuming on one of indexes at between those two heap scans. The
second time heap scan would need to record only TIDs that are not
collected by the first time heap scan.

Clearly,
we do not want to vacuum each partition by scanning the 1GB partition
+ the 50MB local index + the 50GB global index. That's insane. With
the above system, since everything's decoupled, you can vacuum the
partition tables individually as often as required, and similarly for
their local indexes, but put off vacuuming the global index until
you've vacuumed a bunch, maybe all, of the partitions, so that the
work of cleaning up the global index cleans up dead TIDs from many or
all partitions instead of just one at a time.

I can't think of any other way of sensibly implementing global indexes.

Given that we could load all dead TIDs from many or all partitions,
having dead TIDs on the memory with an efficient format would also
become important.

It's probably tricky to get the
autovacuum algorithm right here, but there seems to be some room for
optimism.

I think that it's basically okay if global indexes suck when you do
lots of UPDATEs -- this is a limitation that users can probably live
with. What's not okay is if they suck when you do relatively few
UPDATEs. It's especially not okay to have to scan the global index to
delete one index tuple that points to one LP_DEAD item.

Right. Given decoupling index vacuuming, I think the index’s garbage
statistics are important which preferably need to be fetchable without
accessing indexes. It would be not hard to estimate how many index
tuples might be able to be deleted by looking at the dead TID fork but
it doesn’t necessarily match the actual number.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: decoupling table and index vacuum

On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:

I'm not sure that's the only way to deal with this. While some form of
generic "conveyor belt" infrastructure would be a useful building block,
and it'd be sensible to use it here if it existed, it seems feasible to
dead tids in a different way here. You could e.g. have per-heap-vacuum
files with a header containing LSNs that indicate the age of the
contents.

That's true, but have some reservations about being overly reliant on
the filesystem to provide structure here. There are good reasons to be
worried about bloating the number of files in the data directory. Hmm,
but maybe we could mitigate that. First, we could skip this for small
relations. If you can vacuum the table and all of its indexes using
the naive algorithm in <10 seconds, you probably shouldn't do anything
fancy. That would *greatly* reduce the number of additional files
generated. Second, we could forget about treating them as separate
relation forks and make them some other kind of thing entirely, in a
separate directory, especially if we adopted Sawada-san's proposal to
skip WAL logging. I don't know if that proposal is actually a good
idea, because it effectively adds a performance penalty when you crash
or fail over, and that sort of thing can be an unpleasant surprise.
But it's something to think about.

This scheme adds a lot of complexity, which is a concern, but it seems
It's not completely independent: if you need to set some dead TIDs in
the table to unused, you may have to force index vacuuming that isn't
needed for bloat control. However, you only need to force it for
indexes that haven't been vacuumed recently enough for some other
reason, rather than every index.

Hm - how would we know how recently that TID has been marked dead? We
don't even have xids for dead ItemIds... Maybe you're intending to
answer that in your next paragraph, but it's not obvious to me that'd be
sufficient...

You wouldn't know anything about when things were added in terms of
wall clock time, but the idea was that TIDs get added in order and
stay in that order. So you know which ones were added first. Imagine a
conceptually infinite array of TIDs:

(17,5) (332,6) (5, 1) (2153,92) ....

Each index keeps a pointer into this array. Initially it points to the
start of the array, here (17,5). If an index vacuum starts after
(17,5) and (332,6) have been added to the array but before (5,1) is
added, then upon completion it updates its pointer to point to (5,1).
If every index is pointing to (5,1) or some later element, then you
know that (17,5) and (332,6) can be set LP_UNUSED. If not, and you
want to get to a state where you CAN set (17,5) and (332,6) to
LP_UNUSED, you just need to force index vac on indexes that are
pointing to something prior to (5,1) -- and keep forcing it until
those pointers reach (5,1) or later.

One thing that you didn't mention so far is that this'd allow us to add
dead TIDs to the "dead tid" file outside of vacuum too. In some
workloads most of the dead tuple removal happens as part of on-access
HOT pruning. While some indexes are likely to see that via the
killtuples logic, others may not. Being able to have more aggressive
index vacuum for the one or two bloated index, without needing to rescan
the heap, seems like it'd be a significant improvement.

Oh, that's a very interesting idea. It does impose some additional
requirements on any such system, though, because it means you have to
be able to efficiently add single TIDs. For example, you mention a
per-heap-VACUUM file above, but you can't get away with creating a new
file per HOT prune no matter how you arrange things at the FS level.
Actually, though, I think the big problem here is deduplication. A
full-blown VACUUM can perhaps read all the already-known-to-be-dead
TIDs into some kind of data structure and avoid re-adding them, but
that's impractical for a HOT prune.

Have you thought about how we would do the scheduling of vacuums for the
different indexes? We don't really have useful stats for the number of
dead index entries to be expected in an index. It'd not be hard to track
how many entries are removed in an index via killtuples, but
e.g. estimating how many dead entries there are in a partial index seems
quite hard (at least without introducing significant overhead).

No, I don't have any good ideas about that, really. Partial indexes
seem like a hard problem, and so do GIN indexes or other kinds of
things where you may have multiple index entries per heap tuple. We
might have to accept some known-to-be-wrong approximations in such
cases.

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place.

I suspect we'd need a global limit of space used for this data. If above
that limit we'd switch to immediately performing the work required to
remove some of that space.

I think that's entirely the wrong approach. On the one hand, it
doesn't prevent you from running out of disk space during emergency
maintenance, because the disk overall can be full even though you're
below your quota of space for this particular purpose. On the other
hand, it does subject you to random breakage when your database gets
big enough that the critical information can't be stored within the
configured quota. I think we'd end up with pathological cases very
much like what used to happen with the fixed-size free space map. What
happened there was that your database got big enough that you couldn't
track all the free space any more and it just started bloating out the
wazoo. What would happen here is that you'd silently lose the
well-optimized version of VACUUM when your database gets too big. That
does not seem like something anybody wants.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#3)
Re: decoupling table and index vacuum

On Wed, Apr 21, 2021 at 7:51 PM Peter Geoghegan <pg@bowt.ie> wrote:

I'm very happy to see that you've taken an interest in this work! I
believe it's an important area. It's too important to be left to only
two contributors. I welcome your participation as an equal partner in
the broader project to fix problems with VACUUM.

Err, thanks. I agree this needs broad discussion and participation.

My most ambitious goal is finding a way to remove the need to freeze
or to set hint bits. I think that we can do this by inventing a new
kind of VACUUM just for aborted transactions, which doesn't do index
vacuuming. You'd need something like an ARIES-style dirty page table
to make this cheap -- so it's a little like UNDO, but not very much.

I don't see how that works. An aborted transaction can have made index
entries, and those index entries can have already been moved by page
splits, and there can be arbitrarily many of them, so that you can't
keep track of them all in RAM. Also, you can crash after making the
index entries and writing them to the disk and before the abort
happens. Anyway, this is probably a topic for a separate thread.

I know I say this all the time these days, but it seems worth
repeating now: it is a qualitative difference, not a quantitative
difference.

For the record, I find your quantitative vs. qualitative distinction
to be mostly unhelpful in understanding what's actually going on here.
I've backed into it by reading the explanatory statements you've made
at various times (including here, in the part I didn't quote). But
that phrase in and of itself means very little to me. Other people's
mileage may vary, of course; I'm just telling you how I feel about it.

Right. And, the differences between index A and index B will tend to
be pretty consistent and often much larger than this.

Many indexes would never have to be vacuumed, even with non-HOT
UPDATES due to bottom-up index deletion -- because they literally
won't even have one single page split for hours, while maybe one index
gets 3x larger in the same timeframe. Eventually you'll need to vacuum
the indexes all the same (not just the bloated index), but that's only
required to enable safely performing heap vacuuming. It's not so bad
if the non-bloated indexes won't be dirtied and if it's not so
frequent (dirtying pages is the real cost to keep under control here).

Interesting.

The cost shouldn't be terribly noticeable because you have the
flexibility to change your mind at the first sign of an issue. So you
never pay an extreme cost (you pay a pretty low fixed cost
incrementally, at worst), but you do sometimes (and maybe even often)
get an extreme benefit -- the benefit of avoiding current pathological
performance issues. We know that the cost of bloat is very non-linear
in a bunch of ways that can be pretty well understood, so that seems
like the right thing to focus on -- this is perhaps the only thing
that we can expect to understand with a relatively high degree of
confidence. We can live with a lot of uncertainty about what's going
on with the workload by managing it continually, ramping up and down,
etc.

I generally agree. You want to design a system in a way that's going
to do a good job avoiding pathological cases. The current system is
kind of naive about that. It does things that work well in
middle-of-the-road cases, but often does stupid things in extreme
cases. There are numerous examples of that; one is the "useless
vacuuming" problem about which I've blogged in
http://rhaas.blogspot.com/2020/02/useless-vacuuming.html where the
system keeps on vacuuming because relfrozenxid is old but doesn't
actually succeed in advancing it, so that it's just spinning to no
purpose. Another thing is when it picks the "wrong" thing to do first,
focusing on a less urgent problem rather than a more urgent one. This
can go either way: we might spend a lot of energy cleaning up bloat
when a wraparound shutdown is imminent, but we also might spend a lot
of energy dealing with a wraparound issue that's not yet urgent while
some table bloats out of control. I think it's important not to let
the present discussion get overbroad; we won't be able to solve
everything at once, and trying to do too many things at the same time
will likely result in instability.

Clearly,
we do not want to vacuum each partition by scanning the 1GB partition
+ the 50MB local index + the 50GB global index. That's insane. With
the above system, since everything's decoupled, you can vacuum the
partition tables individually as often as required, and similarly for
their local indexes, but put off vacuuming the global index until
you've vacuumed a bunch, maybe all, of the partitions, so that the
work of cleaning up the global index cleans up dead TIDs from many or
all partitions instead of just one at a time.

I can't think of any other way of sensibly implementing global indexes.

Awesome.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#6)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 9:15 AM Robert Haas <robertmhaas@gmail.com> wrote:

Have you thought about how we would do the scheduling of vacuums for the
different indexes? We don't really have useful stats for the number of
dead index entries to be expected in an index. It'd not be hard to track
how many entries are removed in an index via killtuples, but
e.g. estimating how many dead entries there are in a partial index seems
quite hard (at least without introducing significant overhead).

No, I don't have any good ideas about that, really. Partial indexes
seem like a hard problem, and so do GIN indexes or other kinds of
things where you may have multiple index entries per heap tuple. We
might have to accept some known-to-be-wrong approximations in such
cases.

I think that you're both missing very important subtleties here.
Apparently the "quantitative vs qualitative" distinction I like to
make hasn't cleared it up.

You both seem to be assuming that everything would be fine if you
could somehow inexpensively know the total number of undeleted dead
tuples in each index at all times. But I don't think that that's true
at all. I don't mean that it might not be true. What I mean is that
it's usually a meaningless number *on its own*, at least if you assume
that every index is either an nbtree index (or an index that uses some
other index AM that has the same index deletion capabilities).

My mental models for index bloat usually involve imagining an
idealized version of a real world bloated index -- I compare the
empirical reality against an imagined idealized version. I then try to
find optimizations that make the reality approximate the idealized
version. Say a version of the same index in a traditional 2PL database
without MVCC, or in real world Postgres with VACUUM that magically
runs infinitely fast.

Bottom-up index deletion usually leaves a huge number of
undeleted-though-dead index tuples untouched for hours, even when it
works perfectly. 10% - 30% of the index tuples might be
undeleted-though-dead at any given point in time (traditional B-Tree
space utilization math generally ensures that there is about that much
free space on each leaf page if we imagine no version churn/bloat --
we *naturally* have a lot of free space to work with). These are
"Schrodinger's dead index tuples". You could count them
mechanistically, but then you'd be counting index tuples that are
"already dead and deleted" in an important theoretical sense, despite
the fact that they are not yet literally deleted. That's why bottom-up
index deletion frequently avoids 100% of all unnecessary page splits.
The asymmetry that was there all along was just crazy. I merely had
the realization that it was there and could be exploited -- I didn't
create or invent the natural asymmetry.

A similar issue exists within vacuumlazy.c (though that might matter a
lot less). Notice that it counts a recently dead heap tuple in its
new_dead_tuples accounting, even though the workload can probably
delete that tuple in a just-in-time fashion opportunistically. Might
we be better off recognizing that such a heap tuple is already morally
dead and gone, even if that isn't literally true? (That's a harder
argument to make, and I'm not making it right now -- it's just an
example.)

--
Peter Geoghegan

#9Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#4)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

How do we decide this target, I mean at a given point how do we decide
that what is the limit of dead TID's at which we want to trigger the
index vacuuming?

It's tricky. Essentially, it's a cost-benefit analysis. On the "cost"
side, the expense associated with an index vacuum is basically the
number of pages that we're going to visit, and the number of those
that we're going to dirty. We can know the former with certainty but
can only estimate the latter. On the "benefit" side, setting dead TIDs
unused helps us in two ways. First, it lets us mark heap pages
all-visible, which makes index-only scans work better and reduces the
cost of future vacuuming. These benefits are mitigated by future DML
unsetting those bits again; there's no point in marking a page
all-visible if it's about to be modified again. Second, it avoids line
pointer bloat. Dead line pointers still take up space on the page, and
potentially force the line pointer array to be extended, which can
eventually cause tuples that would have fit into the page to spill
into a different page, possibly a newly-allocated one that forces a
table extension.

It's hard to compare the costs to the benefits because we don't know
the frequency of access. Suppose it costs $1000 to vacuum relevant
indexes and set dead line pointers unused. And suppose that if you do
it, you thereafter will save $1 every time someone does an index-only
scan. If there will be >1000 index-only scans in a meaningful time
frame, it's a good trade-off, but if not, it's a bad one, but it's
difficult to predict the future, and we have limited information even
about the past.

My intuition is that two things that we want to consider are the total
number of dead line pointers in the heap, and the number of pages
across which they are spread. It is also my intuition that the latter
is the more important number, possibly to the extent that we could
ignore the former number completely. But exactly what the thresholds
should be is very unclear to me.

Is it a good idea to always perform an I/O after collecting the dead
TID's or there should be an option where the user can configure so
that it aggressively vacuum all the indexes and this I/O overhead can
be avoided completely.

It's my view that there should definitely be such an option.

As I also mentioned on another thread recently, suppose we pick words
for each phase of vacuum. For the sake of argument, suppose we refer
to the first heap phase as PRUNE, the index phase as SANITIZE, and the
second heap phase as RECYCLE. Then you can imagine syntax like this:

VACUUM (PRUNE) my_table;
VACUUM (SANITIZE) my_table; -- all indexes
VACUUM my_index; -- must be sanitize only
VACUUM (PRUNE, SANITIZE, RECYCLE) my_table; -- do everything

Now in the last case is clearly possible for the system to do
everything in memory since all phases are being performed, but
depending on what we decide, maybe it will choose to use the dead-TID
fork in some cases for some reason or other. If so, we might have
explicit syntax to override that behavior, e.g.

VACUUM (PRUNE, SANITIZE, RECYCLE, TID_STORE 0) my_table;

which might be able to be abbreviated, depending on how we set the
defaults, to just:

VACUUM (TID_STORE 0) my_table;

This is all just hypothetical syntax and probably needs a good deal of
polish and bike-shedding. But it would be really nice to standardize
on some set of terms like prune/sanitize/recycle or whatever we pick,
because then we could document what they mean, use them in autovacuum
log messages, use them internally for function names, use them for
VACUUM option names when we get to that point, etc. and the whole
thing would be a good deal more comprehensible than at present.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#8)
Re: decoupling table and index vacuum

Hi,

On 2021-04-22 11:30:21 -0700, Peter Geoghegan wrote:

I think that you're both missing very important subtleties here.
Apparently the "quantitative vs qualitative" distinction I like to
make hasn't cleared it up.

I'm honestly getting a bit annoyed about this stuff. Yes it's a cool
improvement, but no, it doesn't mean that there aren't still relevant
issues in important cases. It doesn't help that you repeatedly imply
that people that don't see it your way need to have their view "cleared
up".

"Bottom up index deletion" is practically *irrelevant* for a significant
set of workloads.

You both seem to be assuming that everything would be fine if you
could somehow inexpensively know the total number of undeleted dead
tuples in each index at all times.

I don't think we'd need an exact number. Just a reasonable approximation
so we know whether it's worth spending time vacuuming some index.

But I don't think that that's true at all. I don't mean that it might
not be true. What I mean is that it's usually a meaningless number *on
its own*, at least if you assume that every index is either an nbtree
index (or an index that uses some other index AM that has the same
index deletion capabilities).

You also have to assume that you have roughly evenly distributed index
insertions and deletions. But workloads that insert into some parts of a
value range and delete from another range are common.

I even would say that *precisely* because "Bottom up index deletion" can
be very efficient in some workloads it is useful to have per-index stats
determining whether an index should be vacuumed or not.

My mental models for index bloat usually involve imagining an
idealized version of a real world bloated index -- I compare the
empirical reality against an imagined idealized version. I then try to
find optimizations that make the reality approximate the idealized
version. Say a version of the same index in a traditional 2PL database
without MVCC, or in real world Postgres with VACUUM that magically
runs infinitely fast.

Bottom-up index deletion usually leaves a huge number of
undeleted-though-dead index tuples untouched for hours, even when it
works perfectly. 10% - 30% of the index tuples might be
undeleted-though-dead at any given point in time (traditional B-Tree
space utilization math generally ensures that there is about that much
free space on each leaf page if we imagine no version churn/bloat --
we *naturally* have a lot of free space to work with). These are
"Schrodinger's dead index tuples". You could count them
mechanistically, but then you'd be counting index tuples that are
"already dead and deleted" in an important theoretical sense, despite
the fact that they are not yet literally deleted. That's why bottom-up
index deletion frequently avoids 100% of all unnecessary page splits.
The asymmetry that was there all along was just crazy. I merely had
the realization that it was there and could be exploited -- I didn't
create or invent the natural asymmetry.

Except that heap bloat not index bloat might be the more pressing
concern. Or that there will be no meaningful amount of bottom-up
deletions. Or ...

Greetings,

Andres Freund

#11Robert Haas
robertmhaas@gmail.com
In reply to: Masahiko Sawada (#5)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The dead TID fork needs to also be efficiently searched. If the heap
scan runs twice, the collected dead TIDs on each heap pass could be
overlapped. But we would not be able to merge them if we did index
vacuuming on one of indexes at between those two heap scans. The
second time heap scan would need to record only TIDs that are not
collected by the first time heap scan.

I agree that there's a problem here. It seems to me that it's probably
possible to have a dead TID fork that implements "throw away the
oldest stuff" efficiently, and it's probably also possible to have a
TID fork that can be searched efficiently. However, I am not sure that
it's possible to have a dead TID fork that does both of those things
efficiently. Maybe you have an idea. My intuition is that if we have
to pick one, it's MUCH more important to be able to throw away the
oldest stuff efficiently. I think we can work around the lack of
efficient lookup, but I don't see a way to work around the lack of an
efficient operation to discard the oldest stuff.

Right. Given decoupling index vacuuming, I think the index’s garbage
statistics are important which preferably need to be fetchable without
accessing indexes. It would be not hard to estimate how many index
tuples might be able to be deleted by looking at the dead TID fork but
it doesn’t necessarily match the actual number.

Right, and to appeal (I think) to Peter's quantitative vs. qualitative
principle, it could be way off. Like, we could have a billion dead
TIDs and in one index the number of index entries that need to be
cleaned out could be 1 billion and in another index it could be zero
(0). We know how much data we will need to scan because we can fstat()
the index, but there seems to be no easy way to estimate how many of
those pages we'll need to dirty, because we don't know how successful
previous opportunistic cleanup has been. It is not impossible, as
Peter has pointed out a few times now, that it has worked perfectly
and there will be no modifications required, but it is also possible
that it has done nothing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#11)
Re: decoupling table and index vacuum

Hi,

On 2021-04-22 14:47:14 -0400, Robert Haas wrote:

On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. Given decoupling index vacuuming, I think the index’s garbage
statistics are important which preferably need to be fetchable without
accessing indexes. It would be not hard to estimate how many index
tuples might be able to be deleted by looking at the dead TID fork but
it doesn’t necessarily match the actual number.

Right, and to appeal (I think) to Peter's quantitative vs. qualitative
principle, it could be way off. Like, we could have a billion dead
TIDs and in one index the number of index entries that need to be
cleaned out could be 1 billion and in another index it could be zero
(0). We know how much data we will need to scan because we can fstat()
the index, but there seems to be no easy way to estimate how many of
those pages we'll need to dirty, because we don't know how successful
previous opportunistic cleanup has been.

That aspect seems reasonably easy to fix: We can start to report the
number of opportunistically deleted index entries to pgstat. At least
nbtree already performs the actual deletion in bulk and we already have
(currently unused) space in the pgstat entries for it, so I don't think
it'd meanginfully increase overhead. And it'd improve insight in how
indexes operate significantly, even leaving autovacuum etc aside.

Greetings,

Andres Freund

In reply to: Andres Freund (#10)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 11:44 AM Andres Freund <andres@anarazel.de> wrote:

I'm honestly getting a bit annoyed about this stuff.

You're easily annoyed.

Yes it's a cool
improvement, but no, it doesn't mean that there aren't still relevant
issues in important cases. It doesn't help that you repeatedly imply
that people that don't see it your way need to have their view "cleared
up".

I don't think that anything that I've said about it contradicts
anything that you or Robert said. What I said that you're missing a
couple of important subtleties (or that you seem to be). It's not
really about the optimization in particular -- it's about the
subtleties that it exploits. I think that they're generalizable. Even
if there was only a 1% chance of that being true, it would still be
worth exploring in depth.

I think that everybody's beliefs about VACUUM tend to be correct. It
almost doesn't matter if scenario A is the problem in 90% or cases
versus 10% of cases for scenario B (or vice-versa). What actually
matters is that we have good handling for both. (It's probably some
weird combination of scenario A and scenario B in any case.)

"Bottom up index deletion" is practically *irrelevant* for a significant
set of workloads.

You're missing the broader point. Which is that we don't know how much
it helps in each case, just as we don't know how much some other
complementary optimization helps. It's important to develop
complementary techniques precisely because (say) bottom-up index
deletion only solves one class of problem. And because it's so hard to
predict.

I actually went on at length about the cases that the optimization
*doesn't* help. Because that'll be a disproportionate source of
problems now. And you really need to avoid all of the big sources of
trouble to get a really good outcome. Avoiding each and every source
of trouble might be much much more useful than avoiding all but one.

You both seem to be assuming that everything would be fine if you
could somehow inexpensively know the total number of undeleted dead
tuples in each index at all times.

I don't think we'd need an exact number. Just a reasonable approximation
so we know whether it's worth spending time vacuuming some index.

I agree.

You also have to assume that you have roughly evenly distributed index
insertions and deletions. But workloads that insert into some parts of a
value range and delete from another range are common.

I even would say that *precisely* because "Bottom up index deletion" can
be very efficient in some workloads it is useful to have per-index stats
determining whether an index should be vacuumed or not.

Exactly!

Except that heap bloat not index bloat might be the more pressing
concern. Or that there will be no meaningful amount of bottom-up
deletions. Or ...

Exactly!

--
Peter Geoghegan

#14Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#13)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that everybody's beliefs about VACUUM tend to be correct. It
almost doesn't matter if scenario A is the problem in 90% or cases
versus 10% of cases for scenario B (or vice-versa). What actually
matters is that we have good handling for both. (It's probably some
weird combination of scenario A and scenario B in any case.)

I agree strongly with this. In fact, I seem to remember saying similar
things to you in the past. If something wins $1 in 90% of cases and
loses $5 in 10% of cases, is it a good idea? Well, it depends on how
the losses are distributed. If every user can be expected to hit both
winning and losing cases with approximately those frequencies, then
yes, it's a good idea, because everyone will come out ahead on
average. But if 90% of users will see only wins and 10% of users will
see only losses, it sucks.

That being said, I don't know what this really has to do with the
proposal on the table, except in the most general sense. If you're
just saying that decoupling stuff is good because different indexes
have different needs, I am in agreement, as I said in my OP. It sort
of sounded like you were saying that it's not important to try to
estimate the number of undeleted dead tuples in each index, which
puzzled me, because while knowing doesn't mean everything is
wonderful, not knowing it sure seems worse. But I guess maybe that's
not what you were saying, so I don't know. I feel like we're in danger
of drifting into discussions about whether we're disagreeing with each
other rather than, as I would like, focusing on how to design a system
for $SUBJECT.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#6)
Re: decoupling table and index vacuum

Hi,

On 2021-04-22 12:15:27 -0400, Robert Haas wrote:

On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:

I'm not sure that's the only way to deal with this. While some form of
generic "conveyor belt" infrastructure would be a useful building block,
and it'd be sensible to use it here if it existed, it seems feasible to
dead tids in a different way here. You could e.g. have per-heap-vacuum
files with a header containing LSNs that indicate the age of the
contents.

That's true, but have some reservations about being overly reliant on
the filesystem to provide structure here. There are good reasons to be
worried about bloating the number of files in the data directory. Hmm,
but maybe we could mitigate that. First, we could skip this for small
relations. If you can vacuum the table and all of its indexes using
the naive algorithm in <10 seconds, you probably shouldn't do anything
fancy. That would *greatly* reduce the number of additional files
generated. Second, we could forget about treating them as separate
relation forks and make them some other kind of thing entirely, in a
separate directory

I'm not *too* worried about this issue. IMO the big difference to the
cost of additional relation forks is that such files would only exist
when the table is modified to a somewhat meaningful degree. IME the
practical issues with the number of files due to forks are cases where
huge number of tables that are practically never modified exist.

That's not to say that I am sure that some form of "conveyor belt"
storage *wouldn't* be the right thing. How were you thinking of dealing
with the per-relation aspects of this? One conveyor belt per relation?

especially if we adopted Sawada-san's proposal to skip WAL logging. I
don't know if that proposal is actually a good idea, because it
effectively adds a performance penalty when you crash or fail over,
and that sort of thing can be an unpleasant surprise. But it's
something to think about.

I'm doubtful about skipping WAL logging entirely - I'd have to think
harder about it, but I think that'd mean we'd restart from scratch after
crashes / immediate restarts as well, because we couldn't rely on the
contents of the "dead tid" files to be accurate. In addition to the
replication issues you mention.

One thing that you didn't mention so far is that this'd allow us to add
dead TIDs to the "dead tid" file outside of vacuum too. In some
workloads most of the dead tuple removal happens as part of on-access
HOT pruning. While some indexes are likely to see that via the
killtuples logic, others may not. Being able to have more aggressive
index vacuum for the one or two bloated index, without needing to rescan
the heap, seems like it'd be a significant improvement.

Oh, that's a very interesting idea. It does impose some additional
requirements on any such system, though, because it means you have to
be able to efficiently add single TIDs. For example, you mention a
per-heap-VACUUM file above, but you can't get away with creating a new
file per HOT prune no matter how you arrange things at the FS level.

I agree that it'd be an issue, even though I think it's not too common
that only one tuple gets pruned. It might be possible to have a
per-relation file per backend or such... But yes, we'd definitely have
to think about it.

I've previously pondered adding some cross-page batching and deferring
of hot pruning in the read case, which I guess might be more
advantageous with this.

The main reason for thinking about batching & deferring of HOT pruning
is that I found during the AIO work that there's speed gains to be head
if we pad xlog pages instead of partially filling them - obviously
risking increasing WAL usage. One idea to reduce the cost of that was to
fill the padded space with actually useful things, like FPIs or hot
pruning records. A related speedup opportunity with AIO is to perform
useful work while waiting for WAL flushes during commit (i.e. after
initiating IO to flush the commit record, but before that IO has
completed).

Actually, though, I think the big problem here is deduplication. A
full-blown VACUUM can perhaps read all the already-known-to-be-dead
TIDs into some kind of data structure and avoid re-adding them, but
that's impractical for a HOT prune.

What is there to deduplicate during HOT pruning? It seems that hot
pruning would need to log all items that it marks dead, but nothing
else? And that VACUUM can't yet have put those items onto the dead tuple
map, because they weren't yet?

This actually brings up a question I vaguely had to the fore: How are
you assuming indexes would access the list of dead tids? As far as I can
see the on-disk data would not be fully sorted even without adding
things during HOT pruning - the dead tids from a single heap pass will
be, but there'll be tids from multiple passes, right?

Are you assuming that we'd read the data into memory and then merge-sort
between each of the pre-sorted "runs"? Or that we'd read and cache parts
of the on-disk data during index checks?

Have you thought about how we would do the scheduling of vacuums for the
different indexes? We don't really have useful stats for the number of
dead index entries to be expected in an index. It'd not be hard to track
how many entries are removed in an index via killtuples, but
e.g. estimating how many dead entries there are in a partial index seems
quite hard (at least without introducing significant overhead).

No, I don't have any good ideas about that, really. Partial indexes
seem like a hard problem, and so do GIN indexes or other kinds of
things where you may have multiple index entries per heap tuple. We
might have to accept some known-to-be-wrong approximations in such
cases.

The gin case seems a bit easier than the partial index case. Keeping
stats about the number of new entries in a GIN index doesn't seem too
hard, nor does tracking the number of cleaned up index entries. But
knowing which indexes are affected when a heap tuple becomes dead seems
harder. I guess we could just start doing a stats-only version of
ExecInsertIndexTuples() for deletes, but obviously the cost of that is
not enticing. Perhaps it'd not be too bad if we only did it when there's
an index with predicates?

One rather serious objection to this whole line of attack is that we'd
ideally like VACUUM to reclaim disk space without using any more, in
case the motivation for running VACUUM in the first place.

I suspect we'd need a global limit of space used for this data. If above
that limit we'd switch to immediately performing the work required to
remove some of that space.

I think that's entirely the wrong approach. On the one hand, it
doesn't prevent you from running out of disk space during emergency
maintenance, because the disk overall can be full even though you're
below your quota of space for this particular purpose. On the other
hand, it does subject you to random breakage when your database gets
big enough that the critical information can't be stored within the
configured quota.

What random breakage are you thinking of? I'm not thinking of a hard
limit that may not be crossed at any cost, by even a single byte, but
that [auto]VACUUMs would start to be more aggressive about performing
index vacuums once the limit is reached.

I think we'd end up with pathological cases very much like what used
to happen with the fixed-size free space map. What happened there was
that your database got big enough that you couldn't track all the free
space any more and it just started bloating out the wazoo. What would
happen here is that you'd silently lose the well-optimized version of
VACUUM when your database gets too big. That does not seem like
something anybody wants.

I don't think the consequences would really be that comparable. Once the
FSM size was reached in the bad old days, we'd just loose track of of
free space. Whereas here we'd start to be more aggressive about cleaning
up once the "dead tids" data reaches a certain size. Of course that
would have efficiency impacts, but I think "global free space wasted" is
a valid input in deciding when to perform index vacuums.

I think max_wal_size has worked out pretty well, even if not perfect.

Greetings,

Andres Freund

In reply to: Robert Haas (#14)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote:

I agree strongly with this. In fact, I seem to remember saying similar
things to you in the past. If something wins $1 in 90% of cases and
loses $5 in 10% of cases, is it a good idea? Well, it depends on how
the losses are distributed. If every user can be expected to hit both
winning and losing cases with approximately those frequencies, then
yes, it's a good idea, because everyone will come out ahead on
average. But if 90% of users will see only wins and 10% of users will
see only losses, it sucks.

Right. It's essential that we not disadvantage any workload by more
than a small fixed amount (and only with a huge upside elsewhere).

The even more general version is this: the average probably doesn't
even exist in any meaningful sense.

Bottom-up index deletion tends to be effective either 100% of the time
or 0% of the time, which varies on an index by index basis. Does that
mean we should split the difference, and assume that it's effective
50% of the time? Clearly not. Clearly that particular framing is just
wrong. And clearly it basically doesn't matter if it's half of all
indexes, or a quarter, or none, whatever. Because it's all of those
proportions, and also because who cares.

That being said, I don't know what this really has to do with the
proposal on the table, except in the most general sense. If you're
just saying that decoupling stuff is good because different indexes
have different needs, I am in agreement, as I said in my OP.

Mostly what I'm saying is that I would like to put together a rough
list of things that we could do to improve VACUUM along the lines
we've discussed -- all of which stem from $SUBJECT. There are
literally dozens of goals (some of which are quite disparate) that we
could conceivably set out to pursue under the banner of $SUBJECT.
Ideally there would be soft agreement about which ideas were more
promising. Ideally we'd avoid painting ourselves into a corner with
respect to one of these goals, in pursuit of any other goal.

I suspect that we'll need somewhat more of a top-down approach to this
work, which is something that we as a community don't have much
experience with. It might be useful to set the parameters of the
discussion up-front, which seems weird to me too, but might actually
help. (A lot of the current problems with VACUUM seem like they might
be consequences of pgsql-hackers not usually working like this.)

It sort
of sounded like you were saying that it's not important to try to
estimate the number of undeleted dead tuples in each index, which
puzzled me, because while knowing doesn't mean everything is
wonderful, not knowing it sure seems worse. But I guess maybe that's
not what you were saying, so I don't know.

I agree that it matters that we are able to characterize how bloated a
partial index is, because an improved VACUUM implementation will need
to know that. My main point about that was that it's complicated in
surprising ways that actually matter. An approximate solution seems
quite possible to me, but I think that that will probably have to
involve the index AM directly.

Sometimes 10% - 30% of the extant physical index tuples will be dead
and it'll be totally fine in every practical sense -- the index won't
have grown by even one page since the last VACUUM! Other times it
might be as few as 2% - 5% that are now dead when VACUUM is
considered, which will in fact be a serious problem (e.g., it's
concentrated in one part of the keyspace, say). I would say that
having some rough idea of which case we have on our hands is extremely
important here. Even if the distinction only arises in rare cases
(though FWIW I don't think that these differences will be rare at
all).

(I also tried to clarify what I mean about qualitative bloat in
passing in my response about the case of a bloated partial index.)

I feel like we're in danger
of drifting into discussions about whether we're disagreeing with each
other rather than, as I would like, focusing on how to design a system
for $SUBJECT.

While I am certainly guilty of being kind of hand-wavy and talking
about lots of stuff all at once here, it's still kind of unclear what
practical benefits you hope to attain through $SUBJECT. Apart from the
thing about global indexes, which matters but is hardly the
overwhelming reason to do all this. I myself don't expect your goals
to be super crisp just yet. As I said, I'm happy to talk about it in
very general terms at first -- isn't that what you were doing
yourself?

Or did I misunderstand -- are global indexes mostly all that you're
thinking about here? (Even if they are all you care about, it still
seems like you're still somewhat obligated to generalize the dead TID
fork/map thing to help with a bunch of other things, just to justify
the complexity of adding a dead TID relfork.)

--
Peter Geoghegan

In reply to: Robert Haas (#7)
Re: decoupling table and index vacuum

On Thu, Apr 22, 2021 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

My most ambitious goal is finding a way to remove the need to freeze
or to set hint bits. I think that we can do this by inventing a new
kind of VACUUM just for aborted transactions, which doesn't do index
vacuuming. You'd need something like an ARIES-style dirty page table
to make this cheap -- so it's a little like UNDO, but not very much.

I don't see how that works. An aborted transaction can have made index
entries, and those index entries can have already been moved by page
splits, and there can be arbitrarily many of them, so that you can't
keep track of them all in RAM. Also, you can crash after making the
index entries and writing them to the disk and before the abort
happens. Anyway, this is probably a topic for a separate thread.

This is a topic for a separate thread, but I will briefly address your question.

Under the scheme I've sketched, we never do index vacuuming when
invoking an autovacuum worker (or something like it) to clean-up after
an aborted transaction. We track the pages that all transactions have
modified. If a transaction commits then we quickly discard the
relevant dirty page table metadata. If a transaction aborts
(presumably a much rarer event), then we launch an autovacuum worker
that visits precisely those heap blocks that were modified by the
aborted transaction, and just prune each page, one by one. We have a
cutoff that works a little like relfrozenxid, except that it tracks
the point in the XID space before which we know any XIDs (any XIDs
that we can read from extant tuple headers) must be committed.

The idea of a "Dirty page table" is standard ARIES. It'd be tricky to
get it working, but still quite possible.

The overall goal of this design is for the system to be able to reason
about committed-ness inexpensively (to obviate the need for hint bits
and per-tuple freezing). We want to be able to say for sure that
almost all heap blocks in the database only contain heap tuples whose
headers contain only committed XIDs, or LP_DEAD items that are simply
dead (the exact provenance of these LP_DEAD items is not a concern,
just like today). The XID cutoff for committed-ness could be kept
quite recent due to the fact that aborted transactions are naturally
rare. And because we can do relatively little work to "logically roll
back" aborted transactions.

Note that a heap tuple whose xmin and xmax are committed might also be
dead under this scheme, since of course it might have been updated or
deleted by an xact that committed. We've effectively decoupled things
by making aborted transactions special, and subject to very eager
cleanup.

I'm sure that there are significant challenges with making something
like this work. But to me this design seems roughly the right
combination of radical and conservative.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#17)
Getting rid of freezing and hint bits by eagerly vacuuming aborted xacts (was: decoupling table and index vacuum)

On Thu, Apr 22, 2021 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 22, 2021 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

My most ambitious goal is finding a way to remove the need to freeze
or to set hint bits. I think that we can do this by inventing a new
kind of VACUUM just for aborted transactions, which doesn't do index
vacuuming. You'd need something like an ARIES-style dirty page table
to make this cheap -- so it's a little like UNDO, but not very much.

I don't see how that works. An aborted transaction can have made index
entries, and those index entries can have already been moved by page
splits, and there can be arbitrarily many of them, so that you can't
keep track of them all in RAM. Also, you can crash after making the
index entries and writing them to the disk and before the abort
happens. Anyway, this is probably a topic for a separate thread.

This is a topic for a separate thread, but I will briefly address your question.

Under the scheme I've sketched, we never do index vacuuming when
invoking an autovacuum worker (or something like it) to clean-up after
an aborted transaction. We track the pages that all transactions have
modified. If a transaction commits then we quickly discard the
relevant dirty page table metadata. If a transaction aborts
(presumably a much rarer event), then we launch an autovacuum worker
that visits precisely those heap blocks that were modified by the
aborted transaction, and just prune each page, one by one. We have a
cutoff that works a little like relfrozenxid, except that it tracks
the point in the XID space before which we know any XIDs (any XIDs
that we can read from extant tuple headers) must be committed.

The idea of a "Dirty page table" is standard ARIES. It'd be tricky to
get it working, but still quite possible.

The overall goal of this design is for the system to be able to reason
about committed-ness inexpensively (to obviate the need for hint bits
and per-tuple freezing). We want to be able to say for sure that
almost all heap blocks in the database only contain heap tuples whose
headers contain only committed XIDs, or LP_DEAD items that are simply
dead (the exact provenance of these LP_DEAD items is not a concern,
just like today). The XID cutoff for committed-ness could be kept
quite recent due to the fact that aborted transactions are naturally
rare. And because we can do relatively little work to "logically roll
back" aborted transactions.

Note that a heap tuple whose xmin and xmax are committed might also be
dead under this scheme, since of course it might have been updated or
deleted by an xact that committed. We've effectively decoupled things
by making aborted transactions special, and subject to very eager
cleanup.

I'm sure that there are significant challenges with making something
like this work. But to me this design seems roughly the right
combination of radical and conservative.

I'll start a new thread now, as a placeholder for further discussion.

This would be an incredibly ambitious project, and I'm sure that this
thread will be very hand-wavy at first. But you've got to start
somewhere.

--
Peter Geoghegan

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Robert Haas (#11)
Re: decoupling table and index vacuum

On Fri, Apr 23, 2021 at 3:47 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The dead TID fork needs to also be efficiently searched. If the heap
scan runs twice, the collected dead TIDs on each heap pass could be
overlapped. But we would not be able to merge them if we did index
vacuuming on one of indexes at between those two heap scans. The
second time heap scan would need to record only TIDs that are not
collected by the first time heap scan.

I agree that there's a problem here. It seems to me that it's probably
possible to have a dead TID fork that implements "throw away the
oldest stuff" efficiently, and it's probably also possible to have a
TID fork that can be searched efficiently. However, I am not sure that
it's possible to have a dead TID fork that does both of those things
efficiently. Maybe you have an idea. My intuition is that if we have
to pick one, it's MUCH more important to be able to throw away the
oldest stuff efficiently. I think we can work around the lack of
efficient lookup, but I don't see a way to work around the lack of an
efficient operation to discard the oldest stuff.

Agreed.

I think we can divide the TID fork into 16MB or 32MB chunks like WAL
segment files so that we can easily remove old chunks. Regarding the
efficient search part, I think we need to consider the case where the
TID fork gets bigger than maintenance_work_mem. In that case, during
the heap scan, we need to check if the discovered TID exists in a
chunk of the TID fork that could be on the disk. Even if all
known-dead-TIDs are loaded into an array on the memory, it could get
much slower than the current heap scan to bsearch over the array for
each dead TID discovered during heap scan. So it would be better to
have a way to skip searching by already recorded TIDs. For example,
during heap scan or HOT pruning, I think that when marking TIDs dead
and recording it to the dead TID fork we can mark them “dead and
recorded” instead of just “dead” so that future heap scans can skip
those TIDs without existence check.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#20Robert Haas
robertmhaas@gmail.com
In reply to: Masahiko Sawada (#19)
Re: decoupling table and index vacuum

On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we can divide the TID fork into 16MB or 32MB chunks like WAL
segment files so that we can easily remove old chunks. Regarding the
efficient search part, I think we need to consider the case where the
TID fork gets bigger than maintenance_work_mem. In that case, during
the heap scan, we need to check if the discovered TID exists in a
chunk of the TID fork that could be on the disk. Even if all
known-dead-TIDs are loaded into an array on the memory, it could get
much slower than the current heap scan to bsearch over the array for
each dead TID discovered during heap scan. So it would be better to
have a way to skip searching by already recorded TIDs. For example,
during heap scan or HOT pruning, I think that when marking TIDs dead
and recording it to the dead TID fork we can mark them “dead and
recorded” instead of just “dead” so that future heap scans can skip
those TIDs without existence check.

I'm not very excited about this. If we did this, and if we ever
generated dead-but-not-recorded TIDs, then we will potentially dirty
those blocks again later to mark them recorded.

Also, if bsearch() is a bottleneck, how about just using an O(1)
algorithm instead of an O(lg n) algorithm, rather than changing the
on-disk format?

Also, can you clarify exactly what you think the problem case is here?
It seems to me that:

1. If we're pruning the heap to collect dead TIDs, we should stop when
the number of TIDs we've accumulated reaches maintenance_work_mem. It
is possible that we could find when starting to prune that there are
*already* more dead TIDs than will fit, because maintenance_work_mem
might have been reduced since they were gathered. But that's OK: we
can figure out that there are more than will fit without loading them
all, and since we shouldn't do additional pruning in this case,
there's no issue.

2. If we're sanitizing indexes, we should normally discover that there
are few enough TIDs that we can still fit them all in memory. But if
that proves not to be the case, again because for example
maintenance_work_mem has been reduced, then we can handle that with
multiple index passes just as we do today.

3. If we're going back to the heap to permit TIDs to be recycled by
setting dead line pointers to unused, we can load in as many of those
as will fit in maintenance_work_mem, sort them by block number, and go
through block by block and DTRT. Then, we can release all that memory
and, if necessary, do the whole thing again. This isn't even
particularly inefficient.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#16)
In reply to: Robert Haas (#21)
In reply to: Andres Freund (#15)
In reply to: Peter Geoghegan (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#24)
In reply to: Andres Freund (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#26)
In reply to: Andres Freund (#27)
In reply to: Peter Geoghegan (#28)
#30Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Robert Haas (#20)
#31Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#15)
#32Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#31)
#33Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dilip Kumar (#32)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Masahiko Sawada (#33)
#35Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Robert Haas (#34)
#36Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#35)
#37Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#2)
In reply to: Robert Haas (#1)
#39Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Geoghegan (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#38)
In reply to: Masahiko Sawada (#39)
In reply to: Robert Haas (#40)
In reply to: Peter Geoghegan (#42)
#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Geoghegan (#41)
#45Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#44)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#45)
In reply to: Robert Haas (#46)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#47)
In reply to: Robert Haas (#48)
#50Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#46)
In reply to: Dilip Kumar (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#51)
In reply to: Robert Haas (#52)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#53)
In reply to: Robert Haas (#54)
#56Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter Geoghegan (#51)
#57Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter Geoghegan (#55)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#57)
In reply to: Robert Haas (#58)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#59)
In reply to: Robert Haas (#60)
#62Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#58)
#63Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#61)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#62)
In reply to: Robert Haas (#64)
In reply to: Robert Haas (#63)