should vacuum's first heap pass be read-only?

Started by Robert Haasabout 4 years ago30 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

VACUUM's first pass over the heap is implemented by a function called
lazy_scan_heap(), while the second pass is implemented by a function
called lazy_vacuum_heap_rel(). This seems to imply that the first pass
is primarily an examination of what is present, while the second pass
does the real work. This used to be more true than it now is. In
PostgreSQL 7.2, the first release that implemented concurrent vacuum,
the first heap pass could set hint bits as a side effect of calling
HeapTupleSatisfiesVacuum(), and it could freeze old xmins. However,
neither of those things wrote WAL, and you had a reasonable chance of
escaping without dirtying the page at all. By the time PostgreSQL 8.2
was released, it had been understood that making critical changes to
pages without writing WAL was not a good plan, and so freezing now
wrote WAL, but no big deal: most vacuums wouldn't freeze anything
anyway. Things really changed a lot in PostgreSQL 8.3. With the
addition of HOT, lazy_scan_heap() was made to prune the page, meaning
that the first heap pass would likely dirty a large fraction of the
pages that it touched, truncating dead tuples to line pointers and
defragmenting the page. The second heap pass would then have to dirty
the page again to mark dead line pointers unused. In the absolute
worst case, that's a very large increase in WAL generation. VACUUM
could write full page images for all of those pages while HOT-pruning
them, and then a checkpoint could happen, and then VACUUM could write
full-page images of all of them again while marking the dead line
pointers unused. I don't know whether anyone spent time and energy
worrying about this problem, but considering how much HOT improves
performance overall, it would be entirely understandable if this
didn't seem like a terribly important thing to worry about.

But maybe we should reconsider. What benefit do we get out of dirtying
the page twice like this, writing WAL each time? What if we went back
to the idea of having the first heap pass be read-only? In fact, I'm
thinking we might want to go even further and try to prevent even hint
bit changes to the page during the first pass, especially because now
we have checksums and wal_log_hints. If our vacuum cost settings are
to believed (and I am not sure that they are) dirtying a page is 10
times as expensive as reading one from the disk. So on a large table,
we're paying 44 vacuum cost units per heap page vacuumed twice, when
we could be paying only 24 such cost units. What a bargain! The
downside is that we would be postponing, perhaps substantially, the
work that can be done immediately, namely freeing up space in the page
and updating the free space map. The former doesn't seem like a big
loss, because it can be done by anyone who visits the page anyway, and
skipped if nobody does. The latter might be a loss, because getting
the page into the freespace map sooner could prevent bloat by allowing
space to be recycled sooner. I'm not sure how serious a problem this
is. I'm curious what other people think. Would it be worth the delay
in getting pages into the FSM if it means we dirty the pages only
once? Could we have our cake and eat it too by updating the FSM with
the amount of free space that the page WOULD have if we pruned it, but
not actually do so?

I'm thinking about this because of the "decoupling table and index
vacuuming" thread, which I was discussing with Dilip this morning. In
a world where table vacuuming and index vacuuming are decoupled, it
feels like we want to have only one kind of heap vacuum. It pushes us
in the direction of unifying the first and second pass, and doing all
the cleanup work at once. However, I don't know that we want to use
the approach described there in all cases. For a small table that is,
let's just say, not part of any partitioning hierarchy, I'm not sure
that using the conveyor belt approach makes a lot of sense, because
the total amount of work we want to do is so small that we should just
get it over with and not clutter up the disk with more conveyor belt
forks -- especially for people who have large numbers of small tables,
the inode consumption could be a real issue. And we won't really save
anything either. The value of decoupling operations has to do with
improving concurrency and error recovery and allowing global indexes
and a bunch of stuff that, for a small table, simply doesn't matter.
So it would be nice to fall back to an approach more like what we do
now. But then you end up with two fairly distinct code paths, one
where you want the heap phases combined and another where you want
them separated. If the first pass were a strictly read-only pass, you
could do that if there's no conveyor belt, or else read from the
conveyor belt if there is one, and then the phase where you dirty the
heap looks about the same either way.

Aside from the question of whether this is a good idea at all, I'm
also wondering what kinds of experiments we could run to try to find
out. What would be a best case workload for the current strategy vs.
this? What would be a worst case for the current strategy vs. this?
I'm not sure. If you have ideas, I'd love to hear them.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#1)
Re: should vacuum's first heap pass be read-only?

On Thu, Feb 3, 2022 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

But maybe we should reconsider. What benefit do we get out of dirtying
the page twice like this, writing WAL each time? What if we went back
to the idea of having the first heap pass be read-only?

What about recovery conflicts? Index vacuuming WAL records don't
require their own latestRemovedXid field, since they can rely on
earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
vacuuming removes always point to LP_DEAD items in the heap, it's safe
to lean on that.

In fact, I'm
thinking we might want to go even further and try to prevent even hint
bit changes to the page during the first pass, especially because now
we have checksums and wal_log_hints. If our vacuum cost settings are
to believed (and I am not sure that they are) dirtying a page is 10
times as expensive as reading one from the disk. So on a large table,
we're paying 44 vacuum cost units per heap page vacuumed twice, when
we could be paying only 24 such cost units. What a bargain!

In practice HOT generally works well enough that the number of heap
pages that prune significantly exceeds the subset that are also
vacuumed during the second pass over the heap -- at least when heap
fill factor has been tuned (which might be rare). The latter category
of pages is not reported on by the enhanced autovacuum logging added
to Postgres 14, so you might be able to get some sense of how this
works by looking at that.

Could we have our cake and eat it too by updating the FSM with
the amount of free space that the page WOULD have if we pruned it, but
not actually do so?

Did you ever notice that VACUUM records free space after *it* prunes,
using its own horizon? With a long running VACUUM operation, where
unremoved "recently dead" tuples are common, it's possible that the
amount of free space that's effectively available (available to every
other backend) is significantly higher. And so we already record
"subjective amounts of free space" -- though not necessarily by
design.

I'm thinking about this because of the "decoupling table and index
vacuuming" thread, which I was discussing with Dilip this morning. In
a world where table vacuuming and index vacuuming are decoupled, it
feels like we want to have only one kind of heap vacuum. It pushes us
in the direction of unifying the first and second pass, and doing all
the cleanup work at once. However, I don't know that we want to use
the approach described there in all cases. For a small table that is,
let's just say, not part of any partitioning hierarchy, I'm not sure
that using the conveyor belt approach makes a lot of sense, because
the total amount of work we want to do is so small that we should just
get it over with and not clutter up the disk with more conveyor belt
forks -- especially for people who have large numbers of small tables,
the inode consumption could be a real issue.

I'm not sure that what you're proposing here is the best way to go
about it, but let's assume for a moment that it is. Can't you just
simulate the conveyor belt approach, without needing a relation fork?
Just store the same information in memory, accessed using the same
interface, with a spillover path?

Ideally VACUUM will be able to use the conveyor belt for any table.
Whether or not it actually happens should be decided at the latest
possible point during VACUUM, based on considerations about the actual
number of dead items that we now need to remove from indexes, as well
as metadata from any preexisting conveyor belt.

--
Peter Geoghegan

#3Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#1)
Re: should vacuum's first heap pass be read-only?

On Thu, 3 Feb 2022 at 12:21, Robert Haas <robertmhaas@gmail.com> wrote:

VACUUM's first pass over the heap is implemented by a function called
lazy_scan_heap(), while the second pass is implemented by a function
called lazy_vacuum_heap_rel(). This seems to imply that the first pass
is primarily an examination of what is present, while the second pass
does the real work. This used to be more true than it now is.

I've been out of touch for a while but I'm trying to catch up with the
progress of the past few years.

Whatever happened to the idea to "rotate" the work of vacuum. So all
the work of the second pass would actually be deferred until the first
pass of the next vacuum cycle.

That would also have the effect of eliminating the duplicate work,
both the writes with the wal generation as well as the actual scan.
The only heap scan would be "remove line pointers previously cleaned
from indexes and prune dead tuples recording them to clean from
indexes in future". The index scan would remove line pointers and
record them to be removed from the heap in a future heap scan.

The downside would mainly be in the latency before the actual tuples
get cleaned up from the table. That is not so much of an issue as far
as space these days with tuple pruning but is more and more of an
issue with xid wraparound. Also, having to record the line pointers
that have been cleaned from indexes somewhere on disk for the
subsequent vacuum would be extra state on disk and we've learned that
means extra complexity.

--
greg

#4Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#2)
Re: should vacuum's first heap pass be read-only?

On Fri, Feb 4, 2022 at 2:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 3, 2022 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

But maybe we should reconsider. What benefit do we get out of dirtying
the page twice like this, writing WAL each time? What if we went back
to the idea of having the first heap pass be read-only?

What about recovery conflicts? Index vacuuming WAL records don't
require their own latestRemovedXid field, since they can rely on
earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
vacuuming removes always point to LP_DEAD items in the heap, it's safe
to lean on that.

Oh, that's an interesting consideration.

In fact, I'm
thinking we might want to go even further and try to prevent even hint
bit changes to the page during the first pass, especially because now
we have checksums and wal_log_hints. If our vacuum cost settings are
to believed (and I am not sure that they are) dirtying a page is 10
times as expensive as reading one from the disk. So on a large table,
we're paying 44 vacuum cost units per heap page vacuumed twice, when
we could be paying only 24 such cost units. What a bargain!

In practice HOT generally works well enough that the number of heap
pages that prune significantly exceeds the subset that are also
vacuumed during the second pass over the heap -- at least when heap
fill factor has been tuned (which might be rare). The latter category
of pages is not reported on by the enhanced autovacuum logging added
to Postgres 14, so you might be able to get some sense of how this
works by looking at that.

Is there an extra "not" in this sentence? Because otherwise it seems
like you're saying that I should look at the information that isn't
reported, which seems hard.

In any case, I think this might be a death knell for the whole idea.
It might be good to cut down the number of page writes by avoiding
writing them twice -- but not at the expense of having the second pass
have to visit a large number of pages it could otherwise skip. I
suppose we could write only those pages in the first pass that we
aren't going to need to write again later, but at that point I can't
really see that we're winning anything.

Could we have our cake and eat it too by updating the FSM with
the amount of free space that the page WOULD have if we pruned it, but
not actually do so?

Did you ever notice that VACUUM records free space after *it* prunes,
using its own horizon? With a long running VACUUM operation, where
unremoved "recently dead" tuples are common, it's possible that the
amount of free space that's effectively available (available to every
other backend) is significantly higher. And so we already record
"subjective amounts of free space" -- though not necessarily by
design.

Yes, I wondered about that. It seems like maybe a running VACUUM
should periodically refresh its notion of what cutoff to use.

I'm not sure that what you're proposing here is the best way to go
about it, but let's assume for a moment that it is. Can't you just
simulate the conveyor belt approach, without needing a relation fork?
Just store the same information in memory, accessed using the same
interface, with a spillover path?

(I'm not sure it's best either.)

I think my concern here is about not having too many different code
paths from heap vacuuming. I agree that if we're going to vacuum
without an on-disk conveyor belt we can use an in-memory substitute.
However, to Greg's point, if we're using the conveyor belt, it seems
like we want to merge the second pass of one VACUUM into the first
pass of the next one. That is, if we start up a heap vacuum already
having a list of TIDs that can be marked unused, we want to do that
during the same pass of the heap that we prune and search for
newly-discovered dead TIDs. But we can't do that in the case where the
conveyor belt is only simulated, because our in-memory data structure
can't contain leftovers from a previous vacuum the way the on-disk
conveyor belt can. So it seems like the whole algorithm has to be
different. I'd like to find a way to avoid that.

If this isn't entirely making sense, it may well be because I'm a
little fuzzy on all of it myself. But I hope it's clear enough that
you can figure out what it is that I'm worrying about. If not, I'll
keep trying to explain until we both reach a sufficiently non-fuzzy
state.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#3)
Re: should vacuum's first heap pass be read-only?

On Fri, Feb 4, 2022 at 3:05 PM Greg Stark <stark@mit.edu> wrote:

Whatever happened to the idea to "rotate" the work of vacuum. So all
the work of the second pass would actually be deferred until the first
pass of the next vacuum cycle.

That would also have the effect of eliminating the duplicate work,
both the writes with the wal generation as well as the actual scan.
The only heap scan would be "remove line pointers previously cleaned
from indexes and prune dead tuples recording them to clean from
indexes in future". The index scan would remove line pointers and
record them to be removed from the heap in a future heap scan.

I vaguely remember previous discussions of this, but only vaguely, so
if there are threads on list feel free to send pointers. It seems to
me that in order to do this, we'd need some kind of way of storing the
TIDs that were found to be dead in one VACUUM so that they can be
marked unused in the next VACUUM - and the conveyor belt patches on
which Dilip's work is based provide exactly that machinery, which his
patches then leverage to do exactly that thing. But it feels like a
big, sudden change from the way things work now, and I'm trying to
think of ways to make it more incremental, and thus hopefully less
risky.

The downside would mainly be in the latency before the actual tuples
get cleaned up from the table. That is not so much of an issue as far
as space these days with tuple pruning but is more and more of an
issue with xid wraparound. Also, having to record the line pointers
that have been cleaned from indexes somewhere on disk for the
subsequent vacuum would be extra state on disk and we've learned that
means extra complexity.

I don't think there's any XID wraparound issue here. Phase 1 does a
HOT prune, after which only dead line pointers remain, not dead
tuples. And those contain no XIDs. Phase 2 is only setting those dead
line pointers back to unused.

As for the other part, that's pretty much exactly the complexity that
I'm worrying about.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#4)
Re: should vacuum's first heap pass be read-only?

On Fri, Feb 4, 2022 at 3:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

What about recovery conflicts? Index vacuuming WAL records don't
require their own latestRemovedXid field, since they can rely on
earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
vacuuming removes always point to LP_DEAD items in the heap, it's safe
to lean on that.

Oh, that's an interesting consideration.

You'd pretty much have to do "fake pruning", performing the same
computation as pruning without actually pruning.

In practice HOT generally works well enough that the number of heap
pages that prune significantly exceeds the subset that are also
vacuumed during the second pass over the heap -- at least when heap
fill factor has been tuned (which might be rare). The latter category
of pages is not reported on by the enhanced autovacuum logging added
to Postgres 14, so you might be able to get some sense of how this
works by looking at that.

Is there an extra "not" in this sentence? Because otherwise it seems
like you're saying that I should look at the information that isn't
reported, which seems hard.

Sorry, yes. I meant "now" (as in, as of Postgres 14).

In any case, I think this might be a death knell for the whole idea.
It might be good to cut down the number of page writes by avoiding
writing them twice -- but not at the expense of having the second pass
have to visit a large number of pages it could otherwise skip. I
suppose we could write only those pages in the first pass that we
aren't going to need to write again later, but at that point I can't
really see that we're winning anything.

Right. I think that we *can* be more aggressive about deferring heap
page vacuuming until another VACUUM operation with the conveyor belt
stuff. You may well end up getting almost the same benefit that way.

Yes, I wondered about that. It seems like maybe a running VACUUM
should periodically refresh its notion of what cutoff to use.

Yeah, Andres said something about this a few months ago. Shouldn't be
very difficult.

I think my concern here is about not having too many different code
paths from heap vacuuming. I agree that if we're going to vacuum
without an on-disk conveyor belt we can use an in-memory substitute.

Avoiding special cases in vacuumlazy.c seems really important to me.

However, to Greg's point, if we're using the conveyor belt, it seems
like we want to merge the second pass of one VACUUM into the first
pass of the next one.

But it's only going to be safe to do that with those dead TIDs (or
distinct generations of dead TIDs) that are known to already be
removed from all indexes, including indexes that have the least need
for vacuuming (often no direct need at all). I had imagined that we'd
want to do heap vacuuming in the same way as today with the dead TID
conveyor belt stuff -- it just might take several VACUUM operations
until we are ready to do a round of heap vacuuming.

For those indexes that use bottom-up index deletion effectively, the
index structure itself never really needs to be vacuumed to avoid
index bloat. We must nevertheless vacuum these indexes at some point,
just to be able to vacuum heap pages with LP_DEAD items safely.

Overall, I think that there will typically be stark differences among
indexes on the table, in terms of how much vacuuming each index
requires. And so the thing that drives us to perform heap vacuuming
will probably be heap vacuuming itself, and not the fact that each and
every index has become "sufficiently bloated".

If this isn't entirely making sense, it may well be because I'm a
little fuzzy on all of it myself.

I'm in no position to judge. :-)

--
Peter Geoghegan

#7Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#6)
Re: should vacuum's first heap pass be read-only?

On Fri, Feb 4, 2022 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

I had imagined that we'd
want to do heap vacuuming in the same way as today with the dead TID
conveyor belt stuff -- it just might take several VACUUM operations
until we are ready to do a round of heap vacuuming.

I am trying to understand exactly what you are imagining here. Do you
mean we'd continue to lazy_scan_heap() at the start of every vacuum,
and lazy_vacuum_heap_rel() at the end? I had assumed that we didn't
want to do that, because we might already know from the conveyor belt
that there are some dead TIDs that could be marked unused, and it
seems strange to just ignore that knowledge at a time when we're
scanning the heap anyway. However, on reflection, that approach has
something to recommend it, because it would be somewhat simpler to
understand what's actually being changed. We could just:

1. Teach lazy_scan_heap() that it should add TIDs to the conveyor
belt, if we're using one, unless they're already there, but otherwise
work as today.

2. Teach lazy_vacuum_heap_rel() that it, if there is a conveyor belt,
it should try to clear from the indexes all of the dead TIDs that are
eligible.

3. If there is a conveyor belt, use some kind of magic to decide when
to skip vacuuming some or all indexes. When we skip one or more
indexes, the subsequent lazy_vacuum_heap_rel() can't possibly mark as
unused any of the dead TIDs we found this time, so we should just skip
it, unless somehow there are TIDs on the conveyor belt that were
already ready to be marked unused at the start of this VACUUM, in
which case we can still handle those.

Is this the kind of thing you had in mind?

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#7)
Re: should vacuum's first heap pass be read-only?

Yes, that's what I meant. That's always how I thought that it would work,
for over a year now. I might have jumped to the conclusion that that's what
you had in mind all along. Oops.

Although this design is simpler, which is an advantage, that's not really
the point. The point is that it makes sense, and that extra concurrent
with pruning heap vacuuming doesn't seem useful at all.
--
Peter Geoghegan

#9Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#7)
Re: should vacuum's first heap pass be read-only?

On Mon, Feb 7, 2022 at 10:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 4, 2022 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

I had imagined that we'd
want to do heap vacuuming in the same way as today with the dead TID
conveyor belt stuff -- it just might take several VACUUM operations
until we are ready to do a round of heap vacuuming.

I am trying to understand exactly what you are imagining here. Do you
mean we'd continue to lazy_scan_heap() at the start of every vacuum,
and lazy_vacuum_heap_rel() at the end? I had assumed that we didn't
want to do that, because we might already know from the conveyor belt
that there are some dead TIDs that could be marked unused, and it
seems strange to just ignore that knowledge at a time when we're
scanning the heap anyway. However, on reflection, that approach has
something to recommend it, because it would be somewhat simpler to
understand what's actually being changed. We could just:

1. Teach lazy_scan_heap() that it should add TIDs to the conveyor
belt, if we're using one, unless they're already there, but otherwise
work as today.

2. Teach lazy_vacuum_heap_rel() that it, if there is a conveyor belt,
it should try to clear from the indexes all of the dead TIDs that are
eligible.

3. If there is a conveyor belt, use some kind of magic to decide when
to skip vacuuming some or all indexes. When we skip one or more
indexes, the subsequent lazy_vacuum_heap_rel() can't possibly mark as
unused any of the dead TIDs we found this time, so we should just skip
it, unless somehow there are TIDs on the conveyor belt that were
already ready to be marked unused at the start of this VACUUM, in
which case we can still handle those.

Based on this discussion, IIUC, we are saying that now we will do the
lazy_scan_heap every time like we are doing now. And we will
conditionally skip the index vacuum for all or some of the indexes and
then based on how much index vacuum is done we will conditionally do
the lazy_vacuum_heap_rel(). Is my understanding correct?

IMHO, if we are doing the heap scan every time and then we are going
to get the same dead items again which we had previously collected in
the conveyor belt. I agree that we will not add them again into the
conveyor belt but why do we want to store them in the conveyor belt
when we want to redo the whole scanning again?

I think (without global indexes) the main advantage of using the
conveyor belt is that if we skip the index scan for some of the
indexes then we can save the dead item somewhere so that without
scanning the heap again we have those dead items to do the index
vacuum sometime in future but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

IMHO, what we should do is, if there are not many new dead tuples in
the heap (total dead tuple based on the statistic - existing items in
the conveyor belt) then we should conditionally skip the heap scanning
(first pass) and directly jump to the index vacuuming for some or all
the indexes based on the index size bloat.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In reply to: Dilip Kumar (#9)
Re: should vacuum's first heap pass be read-only?

On Fri, Feb 25, 2022 at 5:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on this discussion, IIUC, we are saying that now we will do the
lazy_scan_heap every time like we are doing now. And we will
conditionally skip the index vacuum for all or some of the indexes and
then based on how much index vacuum is done we will conditionally do
the lazy_vacuum_heap_rel(). Is my understanding correct?

I can only speak for myself, but that sounds correct to me. IMO what
we really want here is to create lots of options for VACUUM. To do the
work of index vacuuming when it is most convenient, based on very
recent information about what's going on in each index. There at some
specific obvious ways that it might help. For example, it would be
nice if the failsafe could not really skip index vacuuming -- it could
just put it off until later, after relfrozenxid has been advanced to a
safe value.

Bear in mind that the cost of lazy_scan_heap is often vastly less than
the cost of vacuuming all indexes -- and so doing a bit more work
there than theoretically necessary is not necessarily a problem.
Especially if you have something like UUID indexes, where there is no
natural locality. Many tables have 10+ indexes. Even large tables.

IMHO, if we are doing the heap scan every time and then we are going
to get the same dead items again which we had previously collected in
the conveyor belt. I agree that we will not add them again into the
conveyor belt but why do we want to store them in the conveyor belt
when we want to redo the whole scanning again?

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

I think (without global indexes) the main advantage of using the
conveyor belt is that if we skip the index scan for some of the
indexes then we can save the dead item somewhere so that without
scanning the heap again we have those dead items to do the index
vacuum sometime in future

Global indexes are important in their own right, but ISTM that they
have similar needs to other things anyway. Having this flexibility is
even more important with global indexes, but the concepts themselves
are similar. We want options and maximum flexibility, everywhere.

but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

--
Peter Geoghegan

#11Dilip Kumar
dilipbalaut@gmail.com
In reply to: Peter Geoghegan (#10)
Re: should vacuum's first heap pass be read-only?

On Fri, Feb 25, 2022 at 10:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Feb 25, 2022 at 5:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on this discussion, IIUC, we are saying that now we will do the
lazy_scan_heap every time like we are doing now. And we will
conditionally skip the index vacuum for all or some of the indexes and
then based on how much index vacuum is done we will conditionally do
the lazy_vacuum_heap_rel(). Is my understanding correct?

Bear in mind that the cost of lazy_scan_heap is often vastly less than
the cost of vacuuming all indexes -- and so doing a bit more work
there than theoretically necessary is not necessarily a problem.
Especially if you have something like UUID indexes, where there is no
natural locality. Many tables have 10+ indexes. Even large tables.

Completely agree with that.

IMHO, if we are doing the heap scan every time and then we are going
to get the same dead items again which we had previously collected in
the conveyor belt. I agree that we will not add them again into the
conveyor belt but why do we want to store them in the conveyor belt
when we want to redo the whole scanning again?

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

Yeah we can think of that.

I think (without global indexes) the main advantage of using the
conveyor belt is that if we skip the index scan for some of the
indexes then we can save the dead item somewhere so that without
scanning the heap again we have those dead items to do the index
vacuum sometime in future

Global indexes are important in their own right, but ISTM that they
have similar needs to other things anyway. Having this flexibility is
even more important with global indexes, but the concepts themselves
are similar. We want options and maximum flexibility, everywhere.

+1

but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

Yeah right. But I still think that we should be thinking toward
skipping the first vacuum pass also conditionally. I mean if there
are not many new dead tuples which we realize before evejn starting
the heap scan then why not directly jump to the index vacuuming if
some of the index needs vacuum. But I agree based on some testing and
cost model we can decide what is the best way forward.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#12Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#10)
Re: should vacuum's first heap pass be read-only?

[ returning to this thread after a bit of a hiatus ]

On Fri, Feb 25, 2022 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

I can only speak for myself, but that sounds correct to me. IMO what
we really want here is to create lots of options for VACUUM.

I agree. That seems like a good way of thinking about it.

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

My intuition is that storing redundant TIDs will turn out to be a very
bad idea. I think that if we do that, the system will become prone to
a new kind of vicious cycle (to try to put this in words similar to
the ones you've been using to write about similar effects). Imagine
that we vacuum a table which contains N dead tuples a total of K times
in succession, but each time we either deliberately decide against
index vacuuming or get killed before we can complete it. If we don't
have logic to prevent duplicate entries from being added to the
conveyor belt, we will have N*K TIDs in the conveyor belt rather than
only N, and I think it's not hard to imagine situations where K could
be big enough for that to be hugely painful. Moreover, at some point,
we're going to need to deduplicate anyway, because each of those dead
TIDs can only be marked unused once. Letting the data on the conveyor
belt in the hopes of sorting it out later seems almost certain to be a
losing proposition. The more data we collect on the conveyor belt, the
harder it's going to be when we eventually need to deduplicate.

Also, I think it's possible to deduplicate at a very reasonable cost
so long as (1) we enter each batch of TIDs into the conveyor belt as a
distinguishable "run" and (2) we never accumulate so many of these
runs at the same time that we can't do a single merge pass to turn
them into a sorted output stream. We're always going to discover dead
TIDs in sorted order, so as we make our pass over the heap, we can
simultaneously be doing an on-the-fly merge pass over the existing
runs that are on the conveyor belt. That means we never need to store
all the dead TIDs in memory at once. We just fetch enough data from
each "run" to cover the block numbers for which we're performing the
heap scan, and when the heap scan advances we throw away data for
blocks that we've passed and fetch data for the blocks we're now
reaching.

but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

But having said that, coming back to this with fresh eyes, I think
Dilip has a really good point here. If the first thing we do at the
start of every VACUUM is scan the heap in a way that is guaranteed to
rediscover all of the dead TIDs that we've previously added to the
conveyor belt plus maybe also new ones, we may as well just forget the
whole idea of having a conveyor belt at all. At that point we're just
talking about a system for deciding when to skip index vacuuming, and
the conveyor belt is a big complicated piece of machinery that stores
data we don't really need for anything because if we threw it out the
next vacuum would reconstruct it anyhow and we'd get exactly the same
results with less code. The only way the conveyor belt system has any
value is if we think that there is some set of circumstances where the
heap scan is separated in time from the index vacuum, such that we
might sometimes do an index vacuum without having done a heap scan
just before.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#12)
Re: should vacuum's first heap pass be read-only?

On Thu, Mar 31, 2022 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

My intuition is that storing redundant TIDs will turn out to be a very
bad idea. I think that if we do that, the system will become prone to
a new kind of vicious cycle (to try to put this in words similar to
the ones you've been using to write about similar effects).

I don't feel very strongly about it either way. I definitely think
that there are workloads for which that will be true, and I in general
I am no fan of putting off work that cannot possibly turn out to be
unnecessary in the end. I am in favor of "good laziness", not "bad
laziness".

The more data we collect on the conveyor belt, the
harder it's going to be when we eventually need to deduplicate.

Also, I think it's possible to deduplicate at a very reasonable cost
so long as (1) we enter each batch of TIDs into the conveyor belt as a
distinguishable "run"

I definitely think that's the way to go, in general (regardless of
what we do about deduplicating TIDs).

and (2) we never accumulate so many of these
runs at the same time that we can't do a single merge pass to turn
them into a sorted output stream. We're always going to discover dead
TIDs in sorted order, so as we make our pass over the heap, we can
simultaneously be doing an on-the-fly merge pass over the existing
runs that are on the conveyor belt. That means we never need to store
all the dead TIDs in memory at once.

That's a good idea IMV. I am vaguely reminded of an LSM tree.

We just fetch enough data from
each "run" to cover the block numbers for which we're performing the
heap scan, and when the heap scan advances we throw away data for
blocks that we've passed and fetch data for the blocks we're now
reaching.

I wonder if you've thought about exploiting the new approach to
skipping pages using the visibility map from my patch series (which
you reviewed recently). I think that putting that in scope here could
be very helpful. As a way of making the stuff we already want to do
with [global] indexes easier, but also as independently useful work
based on the conveyor belt. The visibility map is very underexploited
as a source of accurate information about what VACUUM should be doing
IMV (in autovacuum.c's scheduling logic, but also in vacuumlazy.c
itself).

Imagine a world in which we decide up-front what pages we're going to
scan (i.e. our final vacrel->scanned_pages), by first scanning the
visibility map, and serializing it in local memory, or sometimes in
disk using the conveyor belt. Now you'll have a pretty good idea how
much TID deduplication will be necessary when you're done (to give one
example). In general "locking in" a plan of action for VACUUM seems
like it would be very useful. It will tend to cut down on the number
of "dead but not yet removable" tuples that VACUUM encounters right
now -- you at least avoid visiting concurrently modified heap pages
that were all-visible back when OldestXmin was first established.

When all skippable ranges are known in advance, we can reorder things
in many different ways -- since the work of vacuuming can be
decomposed on the lazy_scan_heap side, too. The only ordering
dependency among heap pages (that I can think of offhand) is FSM
vacuuming, which seems like it could be addressed without great
difficulty.

Ideally everything can be processed in whatever order is convenient as
of right now, based on costs and benefits. We could totally decompose
the largest VACUUM operations into individually processable unit of
work (heap and index units), so that individual autovacuum workers no
longer own particular VACUUM operations at all. Autovacuum workers
would instead multiplex all of the units of work from all pending
VACUUMs, that are scheduled centrally, based on the current load of
the system. We can probably afford to be much more sensitive to the
number of pages we'll dirty relative to what the system can take right
now, and so on.

Cancelling an autovacuum worker may just make the system temporarily
suspend the VACUUM operation for later with this design (in fact
cancelling an autovacuum may not really be meaningful at all). A
design like this could also enable things like changing our mind about
advancing relfrozenxid: if we decided to skip all-visible pages in
this VACUUM operation, and come to regret that decision by the end
(because lots of XIDs are consumed in the interim), maybe we can go
back and get the all-visible pages we missed first time around.

I don't think that all of these ideas will turn out to be winners, but
I'm just trying to stimulate discussion. Breaking almost all
dependencies on the order that we process things in seems like it has
real promise.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

But having said that, coming back to this with fresh eyes, I think
Dilip has a really good point here. If the first thing we do at the
start of every VACUUM is scan the heap in a way that is guaranteed to
rediscover all of the dead TIDs that we've previously added to the
conveyor belt plus maybe also new ones, we may as well just forget the
whole idea of having a conveyor belt at all.

I definitely agree that that's bad, and would be the inevitable result
of being lazy about deduplicating consistently.

The only way the conveyor belt system has any
value is if we think that there is some set of circumstances where the
heap scan is separated in time from the index vacuum, such that we
might sometimes do an index vacuum without having done a heap scan
just before.

I agree.

--
Peter Geoghegan

#14Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#13)
Re: should vacuum's first heap pass be read-only?

On Thu, Mar 31, 2022 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

The only way the conveyor belt system has any
value is if we think that there is some set of circumstances where the
heap scan is separated in time from the index vacuum, such that we
might sometimes do an index vacuum without having done a heap scan
just before.

I agree.

But in /messages/by-id/CA+Tgmoa6kVEeurtyeOi3a+rA2XuynwQmJ_s-h4kUn6-bKMMDRw@mail.gmail.com
(and the messages just before and just after it) we seemed to be
agreeing on a design where that's exactly what happens. It seemed like
a good idea to me at the time, but now it seems like it's a bad idea,
because it involves using the conveyor belt in a way that adds no
value.

Am I confused here?

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#14)
Re: should vacuum's first heap pass be read-only?

On Thu, Mar 31, 2022 at 5:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

I agree.

But in /messages/by-id/CA+Tgmoa6kVEeurtyeOi3a+rA2XuynwQmJ_s-h4kUn6-bKMMDRw@mail.gmail.com
(and the messages just before and just after it) we seemed to be
agreeing on a design where that's exactly what happens. It seemed like
a good idea to me at the time, but now it seems like it's a bad idea,
because it involves using the conveyor belt in a way that adds no
value.

There are two types of heap scans (in my mind, at least): those that
prune, and those that VACUUM. While there has traditionally been a 1:1
correspondence between these two scans (barring cases with no LP_DEAD
items whatsoever), that's no longer in Postgres 14, which added the
"bypass index scan in the event of few LP_DEAD items left by pruning"
optimization (or Postgres 12 if you count INDEX_CLEANUP=off).

When I said "I agree" earlier today, I imagined that I was pretty much
affirming everything else that I'd said up until that point of the
email. Which is that the conveyor belt is interesting as a way of
breaking (or even just loosening) dependencies on the *order* in which
we perform work within a given "VACUUM cycle". Things can be much
looser than they are today, with indexes (which we've discussed a lot
already), and even with heap pruning (which I brought up for the first
time just today).

However, I don't see any way that it will be possible to break one
particular ordering dependency, even with the conveyor belt stuff: The
"basic invariant" described in comments above lazy_scan_heap(), which
describes rules about TID recycling -- we can only recycle TIDs when a
full "VACUUM cycle" completes, just like today.

That was a point I was making in the email from back in February:
obviously it's unsafe to do lazy_vacuum_heap_page() processing of a
page until we're already 100% sure that the LP_DEAD items are not
referenced by any indexes, even indexes that have very little bloat
(that don't really need to be vacuumed for their own sake). However,
the conveyor belt can add value by doing much more frequent processing
in lazy_scan_prune() (of different pages each time, or perhaps even
repeat processing of the same heap pages), and much more frequent
index vacuuming for those indexes that seem to need it.

So the lazy_scan_prune() work (pruning and freezing) can and probably
should be separated in time from the index vacuuming (compared to the
current design). Maybe not for all of the indexes -- typically for the
majority, maybe 8 out of 10. We can do much less index vacuuming in
those indexes that don't really need it, in order to be able to do
much more in those that do. At some point we must "complete a whole
cycle of heap vacuuming" by processing all the heap pages using
lazy_vacuum_heap_page() that need it.

Separately, the conveyor belt seems to have promise as a way of
breaking up work for multiplexing, or parallel processing.

--
Peter Geoghegan

#16Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#12)
Re: should vacuum's first heap pass be read-only?

On Fri, Apr 1, 2022 at 1:55 AM Robert Haas <robertmhaas@gmail.com> wrote:

But having said that, coming back to this with fresh eyes, I think
Dilip has a really good point here. If the first thing we do at the
start of every VACUUM is scan the heap in a way that is guaranteed to
rediscover all of the dead TIDs that we've previously added to the
conveyor belt plus maybe also new ones, we may as well just forget the
whole idea of having a conveyor belt at all. At that point we're just
talking about a system for deciding when to skip index vacuuming, and
the conveyor belt is a big complicated piece of machinery that stores
data we don't really need for anything because if we threw it out the
next vacuum would reconstruct it anyhow and we'd get exactly the same
results with less code.

After thinking more about this I see there is some value of
remembering the dead tids in the conveyor belt. Basically, the point
is if there are multiple indexes and we do the index vacuum for some
of the indexes and skip for others. And now when we again do the
complete vacuum cycle that time we will again get all the old dead
tids + the new dead tids but without conveyor belt we might need to
perform the multiple cycle of the index vacuum even for the indexes
for which we had done the vacuum in previous vacuum cycle (if all tids
are not fitting in maintenance work mem). But with the conveyor belt
we remember the conveyor belt pageno upto which we have done the index
vacuum and then we only need to do vacuum for the remaining tids which
will definitely reduce the index vacuuming passes, right?

So my stand is, a) for the global indexes we must need the conveyor
belt for remembering the partition wise dead tids (because after
vacuuming certain partitions when we go for global index vacuuming we
don't want to rescan all the partitions to get the same dead items) b)
and even without global indexes there are advantages of storing dead
items in the conveyor belt as explained in my previous paragraph. So
I think it is worth adding the conveyor belt infrastructure.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In reply to: Dilip Kumar (#16)
Re: should vacuum's first heap pass be read-only?

On Thu, Mar 31, 2022 at 9:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

But with the conveyor belt
we remember the conveyor belt pageno upto which we have done the index
vacuum and then we only need to do vacuum for the remaining tids which
will definitely reduce the index vacuuming passes, right?

Right, exactly -- the index or two that really need to be vacuumed a
lot can have relatively small dead_items arrays.

Other indexes (when eventually vacuumed) will need a larger dead_items
array, with everything we need to get rid of from the index in one big
array. Hopefully this won't matter much. Vacuuming these indexes
should be required infrequently (compared to the bloat-prone indexes).

As I said upthread, when we finally have to perform heap vacuuming
(not heap pruning), it'll probably happen because the heap itself
needs heap vacuuming. We could probably get away with *never* vacuum
certain indexes on tables prone to non-HOT updates, without that ever
causing index bloat. But heap line pointer bloat is eventually going
to become a real problem with non-HOT updates, no matter what.

--
Peter Geoghegan

#18Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#16)
Re: should vacuum's first heap pass be read-only?

On Fri, Apr 1, 2022 at 12:08 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

After thinking more about this I see there is some value of
remembering the dead tids in the conveyor belt. Basically, the point
is if there are multiple indexes and we do the index vacuum for some
of the indexes and skip for others. And now when we again do the
complete vacuum cycle that time we will again get all the old dead
tids + the new dead tids but without conveyor belt we might need to
perform the multiple cycle of the index vacuum even for the indexes
for which we had done the vacuum in previous vacuum cycle (if all tids
are not fitting in maintenance work mem). But with the conveyor belt
we remember the conveyor belt pageno upto which we have done the index
vacuum and then we only need to do vacuum for the remaining tids which
will definitely reduce the index vacuuming passes, right?

I guess you're right, and it's actually a little bit better than that,
because even if the data does fit into shared memory, we'll have to
pass fewer TIDs to the worker to be removed from the heap, which might
save a few CPU cycles. But I think both of those are very small
benefits. If that's all we're going to do with the conveyor belt
infrastructure, I don't think it's worth the effort. I am completely
in agreement with Peter's comments to the effect that the goal here is
to create flexibility, but it feels to me like the particular
development plan we discussed back in late February isn't going to
create enough flexibility to make it worth the effort it takes. It
seems to me we need to find a way to do better than that.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#18)
Re: should vacuum's first heap pass be read-only?

On Fri, Apr 1, 2022 at 11:04 AM Robert Haas <robertmhaas@gmail.com> wrote:

I guess you're right, and it's actually a little bit better than that,
because even if the data does fit into shared memory, we'll have to
pass fewer TIDs to the worker to be removed from the heap, which might
save a few CPU cycles. But I think both of those are very small
benefits.

I'm not following. It seems like you're saying that the ability to
vacuum indexes on their own schedule (based on their own needs) is not
sufficiently compelling. I think it's very compelling, with enough
indexes (and maybe not very many).

The conveyor belt doesn't just save I/O from repeated scanning of the
heap. It may also save on repeated pruning (or just dirtying) of the
same heap pages again and again, for very little benefit.

Imagine an append-only table where 1% of transactions that insert are
aborts. You really want to be able to constantly VACUUM such a table,
so that its pages are proactively frozen and set all-visible in the
visibility map -- it's not that different to a perfectly append-only
table, without any garbage tuples. And so it would be very useful if
we could delay index vacuuming for much longer than the current 2% of
rel_pages heuristics seems to allow.

That heuristic has to conservatively assume that it might be some time
before the next vacuum is launched, and has the opportunity to
reconsider index vacuuming. What if it was a more or less independent
question instead? To put it another way, it would be great if the
scheduling code for autovacuum could make inferences about what
general strategy works best for a given table over time. In order to
be able to do that sensibly, the algorithm needs more context, so that
it can course correct without paying much of a cost for being wrong.

--
Peter Geoghegan

#20Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#19)
Re: should vacuum's first heap pass be read-only?

On Fri, Apr 1, 2022 at 2:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

I'm not following. It seems like you're saying that the ability to
vacuum indexes on their own schedule (based on their own needs) is not
sufficiently compelling. I think it's very compelling, with enough
indexes (and maybe not very many).

The conveyor belt doesn't just save I/O from repeated scanning of the
heap. It may also save on repeated pruning (or just dirtying) of the
same heap pages again and again, for very little benefit.

I'm also not following. In order to get that benefit, we would have to
sometimes decide to not perform lazy_scan_heap() at the startup of a
vacuum. And in this email I asked you whether it was your idea that we
should always start a vacuum operation with lazy_scan_heap(), and you
said "yes":

/messages/by-id/CA+Tgmoa6kVEeurtyeOi3a+rA2XuynwQmJ_s-h4kUn6-bKMMDRw@mail.gmail.com

So I'm completely confused here. If we always start a vacuum with
lazy_scan_heap(), as you said you wanted, then we will not save any
heap scanning.

What am I missing?

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#20)
#22Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#18)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#22)
In reply to: Robert Haas (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#24)
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#26)
In reply to: Robert Haas (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#28)
In reply to: Robert Haas (#29)