should vacuum's first heap pass be read-only?

Started by Robert Haasabout 4 years ago30 messageshackers

robertmhaas@gmail.com

about 4 years ago

VACUUM's first pass over the heap is implemented by a function called
lazy_scan_heap(), while the second pass is implemented by a function
called lazy_vacuum_heap_rel(). This seems to imply that the first pass
is primarily an examination of what is present, while the second pass
does the real work. This used to be more true than it now is. In
PostgreSQL 7.2, the first release that implemented concurrent vacuum,
the first heap pass could set hint bits as a side effect of calling
HeapTupleSatisfiesVacuum(), and it could freeze old xmins. However,
neither of those things wrote WAL, and you had a reasonable chance of
escaping without dirtying the page at all. By the time PostgreSQL 8.2
was released, it had been understood that making critical changes to
pages without writing WAL was not a good plan, and so freezing now
wrote WAL, but no big deal: most vacuums wouldn't freeze anything
anyway. Things really changed a lot in PostgreSQL 8.3. With the
addition of HOT, lazy_scan_heap() was made to prune the page, meaning
that the first heap pass would likely dirty a large fraction of the
pages that it touched, truncating dead tuples to line pointers and
defragmenting the page. The second heap pass would then have to dirty
the page again to mark dead line pointers unused. In the absolute
worst case, that's a very large increase in WAL generation. VACUUM
could write full page images for all of those pages while HOT-pruning
them, and then a checkpoint could happen, and then VACUUM could write
full-page images of all of them again while marking the dead line
pointers unused. I don't know whether anyone spent time and energy
worrying about this problem, but considering how much HOT improves
performance overall, it would be entirely understandable if this
didn't seem like a terribly important thing to worry about.

But maybe we should reconsider. What benefit do we get out of dirtying
the page twice like this, writing WAL each time? What if we went back
to the idea of having the first heap pass be read-only? In fact, I'm
thinking we might want to go even further and try to prevent even hint
bit changes to the page during the first pass, especially because now
we have checksums and wal_log_hints. If our vacuum cost settings are
to believed (and I am not sure that they are) dirtying a page is 10
times as expensive as reading one from the disk. So on a large table,
we're paying 44 vacuum cost units per heap page vacuumed twice, when
we could be paying only 24 such cost units. What a bargain! The
downside is that we would be postponing, perhaps substantially, the
work that can be done immediately, namely freeing up space in the page
and updating the free space map. The former doesn't seem like a big
loss, because it can be done by anyone who visits the page anyway, and
skipped if nobody does. The latter might be a loss, because getting
the page into the freespace map sooner could prevent bloat by allowing
space to be recycled sooner. I'm not sure how serious a problem this
is. I'm curious what other people think. Would it be worth the delay
in getting pages into the FSM if it means we dirty the pages only
once? Could we have our cake and eat it too by updating the FSM with
the amount of free space that the page WOULD have if we pruned it, but
not actually do so?

I'm thinking about this because of the "decoupling table and index
vacuuming" thread, which I was discussing with Dilip this morning. In
a world where table vacuuming and index vacuuming are decoupled, it
feels like we want to have only one kind of heap vacuum. It pushes us
in the direction of unifying the first and second pass, and doing all
the cleanup work at once. However, I don't know that we want to use
the approach described there in all cases. For a small table that is,
let's just say, not part of any partitioning hierarchy, I'm not sure
that using the conveyor belt approach makes a lot of sense, because
the total amount of work we want to do is so small that we should just
get it over with and not clutter up the disk with more conveyor belt
forks -- especially for people who have large numbers of small tables,
the inode consumption could be a real issue. And we won't really save
anything either. The value of decoupling operations has to do with
improving concurrency and error recovery and allowing global indexes
and a bunch of stuff that, for a small table, simply doesn't matter.
So it would be nice to fall back to an approach more like what we do
now. But then you end up with two fairly distinct code paths, one
where you want the heap phases combined and another where you want
them separated. If the first pass were a strictly read-only pass, you
could do that if there's no conveyor belt, or else read from the
conveyor belt if there is one, and then the phase where you dirty the
heap looks about the same either way.

Aside from the question of whether this is a good idea at all, I'm
also wondering what kinds of experiments we could run to try to find
out. What would be a best case workload for the current strategy vs.
this? What would be a worst case for the current strategy vs. this?
I'm not sure. If you have ideas, I'd love to hear them.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com