BRIN FSM vacuuming questions
I noticed that several places in brin_pageops.c do this sort of thing:
if (extended)
{
Assert(BlockNumberIsValid(newblk));
RecordPageWithFreeSpace(idxrel, newblk, freespace);
FreeSpaceMapVacuum(idxrel);
}
ie they put *one* page into the index's free space map and then invoke a
scan of the index's entire FSM to ensure that that info is bubbled up to
the top immediately. I wonder how carefully this was thought through.
We don't generally think that it's worth doing an FSM vacuum for a single
page worth of free space.
We could make these operations somewhat cheaper, in the wake of 851a26e26,
by doing
- FreeSpaceMapVacuum(idxrel);
+ FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1);
so that only the relevant subtree of the FSM gets scanned. But it seems
at least as plausible to just drop the FreeSpaceMapVacuum calls entirely
and let the next VACUUM fix up the map, like the entire rest of the
system does. This is particularly true because the code is inconsistent:
some places that do RecordPageWithFreeSpace follow it up with an immediate
FreeSpaceMapVacuum, and some don't.
Or, perhaps, there's actually some method behind this madness and there's
good reason to think that FreeSpaceMapVacuum calls in these specific
places are worth the cost? If so, a comment documenting the reasoning
would be appropriate.
I'm also finding this code in brin_page_cleanup a bit implausible:
/* Measure free space and record it */
freespace = br_page_get_freespace(page);
if (freespace > GetRecordedFreeSpace(idxrel, BufferGetBlockNumber(buf)))
{
RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf), freespace);
return true;
}
because of the fact that GetRecordedFreeSpace will report a rounded-down
result owing to the limited resolution of the FSM's numbers. If we
assume that the result of br_page_get_freespace is always maxaligned,
then by my math there's a 75% chance (on maxalign-8 machines; more if
maxalign is 4) that this will think there's a discrepancy even when
there is none. It seems questionable to me that it's worth checking
GetRecordedFreeSpace at all. If it really is, we'd need to get the
recorded free space, invoke RecordPageWithFreeSpace, and then see if
the result of GetRecordedFreeSpace has changed in order to decide whether
anything really changed. But that seems overly complicated, and again
unlike anything done elsewhere. Also, the result value is used to
decide whether FreeSpaceMapVacuum is needed, but I'm unconvinced that
it's a good idea to skip said call just because the leaf FSM values did
not change; that loses the opportunity to clean up any upper-level FSM
errors that may exist.
In short, I'd suggest making this RecordPageWithFreeSpace call
unconditional, dropping the result value of brin_page_cleanup,
and doing FreeSpaceMapVacuum unconditionally too, back in
brin_vacuum_scan. I don't see a reason for brin to be unlike
the rest of the system here.
Lastly, I notice that brin_vacuum_scan's loop is coded like
for (blkno = 0; blkno < RelationGetNumberOfBlocks(idxrel); blkno++)
{
...
}
which means we're incurring an lseek kernel call for *each iteration*
of the loop. Surely that's pretty inefficient. Is there an intention
here to be sure we examine every page even in the face of concurrent
extensions? If so, we could code it like this:
nblocks = RelationGetNumberOfBlocks(idxrel);
for (blkno = 0; blkno < nblocks; blkno++)
{
...
if (blkno == nblocks-1)
nblocks = RelationGetNumberOfBlocks(idxrel);
}
but I'd just as soon skip the recheck unless there's a really strong
reason for it. Even with that, there's no guarantee somebody hasn't added
another page before we ever get out of the function, so I'm unclear
on what the point is.
regards, tom lane
Tom Lane wrote:
I noticed that several places in brin_pageops.c do this sort of thing:
if (extended)
{
Assert(BlockNumberIsValid(newblk));
RecordPageWithFreeSpace(idxrel, newblk, freespace);
FreeSpaceMapVacuum(idxrel);
}ie they put *one* page into the index's free space map and then invoke a
scan of the index's entire FSM to ensure that that info is bubbled up to
the top immediately. I wonder how carefully this was thought through.
We don't generally think that it's worth doing an FSM vacuum for a single
page worth of free space.
Yeah, there was no precedent for doing that, but it was actually
intentional: there were enough cases where we would extend the relation,
only to have the operation fail for unrelated reasons (things like a
concurrent heap insertion), so it was possible to lose free space very
fast, and since BRIN indexes are slow to grow it would become
noticeable. We could have waited for FreeSpaceMapVacuum to occur
naturally, but this would cause the index to bloat until next vacuum,
which could happen a long time later. Also, the idea behind BRIN is
that the indexed tables are very large, so vacuuming is infrequent.
I also considered the fact that we can save the newly acquired page in
relcache's "target block", but I was scared of relying just on that
because AFAIR it's easily lost on relcache invalidations.
We could make these operations somewhat cheaper, in the wake of 851a26e26,
by doing- FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1);so that only the relevant subtree of the FSM gets scanned. But it seems
at least as plausible to just drop the FreeSpaceMapVacuum calls entirely
and let the next VACUUM fix up the map, like the entire rest of the
system does. This is particularly true because the code is inconsistent:
some places that do RecordPageWithFreeSpace follow it up with an immediate
FreeSpaceMapVacuum, and some don't.
Hmm I analyzed the callsites carefully to ensure it wasn't possible to
bail out without vacuuming in the normal corner cases. Maybe you
spotted some gap in my analysis -- or worse, maybe I had to change
nearby code for unrelated reasons and broke it afterwards inadvertently.
Or, perhaps, there's actually some method behind this madness and there's
good reason to think that FreeSpaceMapVacuum calls in these specific
places are worth the cost? If so, a comment documenting the reasoning
would be appropriate.
Range-vacuuming the FSM seems a good idea for these places -- no need to
trawl the rest of the tree. I'm not too excited about dropping the FSM
vacuuming completely and waiting till regular VACUUM. (Why do we call
this FSM operation "vacuum"? It seems pretty confusing.)
I agree that brin_getinsertbuffer should have a better comment to
explain the intention (and that routine seems to carry the most guilt in
failing to do the FSM vacuuming).
I'm also finding this code in brin_page_cleanup a bit implausible:
/* Measure free space and record it */
freespace = br_page_get_freespace(page);
if (freespace > GetRecordedFreeSpace(idxrel, BufferGetBlockNumber(buf)))
{
RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf), freespace);
return true;
}because of the fact that GetRecordedFreeSpace will report a rounded-down
result owing to the limited resolution of the FSM's numbers. If we
assume that the result of br_page_get_freespace is always maxaligned,
then by my math there's a 75% chance (on maxalign-8 machines; more if
maxalign is 4) that this will think there's a discrepancy even when
there is none. It seems questionable to me that it's worth checking
GetRecordedFreeSpace at all. If it really is, we'd need to get the
recorded free space, invoke RecordPageWithFreeSpace, and then see if
the result of GetRecordedFreeSpace has changed in order to decide whether
anything really changed. But that seems overly complicated, and again
unlike anything done elsewhere. Also, the result value is used to
decide whether FreeSpaceMapVacuum is needed, but I'm unconvinced that
it's a good idea to skip said call just because the leaf FSM values did
not change; that loses the opportunity to clean up any upper-level FSM
errors that may exist.In short, I'd suggest making this RecordPageWithFreeSpace call
unconditional, dropping the result value of brin_page_cleanup,
and doing FreeSpaceMapVacuum unconditionally too, back in
brin_vacuum_scan. I don't see a reason for brin to be unlike
the rest of the system here.
As I recall, this was just trying to save unnecessary FSM updates, and
failed to make the connection with lossiness in GetRecordedFreeSpace.
But since FSM updates are quite cheap anyway (since they're not WAL
logged), I agree with your proposed fix.
Lastly, I notice that brin_vacuum_scan's loop is coded like
for (blkno = 0; blkno < RelationGetNumberOfBlocks(idxrel); blkno++)
{
...
}which means we're incurring an lseek kernel call for *each iteration*
of the loop. Surely that's pretty inefficient.
Oops, this is just a silly oversight.
Is there an intention here to be sure we examine every page even in
the face of concurrent extensions? If so, we could code it like this:nblocks = RelationGetNumberOfBlocks(idxrel);
for (blkno = 0; blkno < nblocks; blkno++)
{
...
if (blkno == nblocks-1)
nblocks = RelationGetNumberOfBlocks(idxrel);
}but I'd just as soon skip the recheck unless there's a really strong
reason for it. Even with that, there's no guarantee somebody hasn't added
another page before we ever get out of the function, so I'm unclear
on what the point is.
Yeah, as I recall, this code is only there to cope with the server
crashing beforehand; I don't think we need the recheck.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Tom Lane wrote:
I noticed that several places in brin_pageops.c do this sort of thing:
...
ie they put *one* page into the index's free space map and then invoke a
scan of the index's entire FSM to ensure that that info is bubbled up to
the top immediately. I wonder how carefully this was thought through.
We don't generally think that it's worth doing an FSM vacuum for a single
page worth of free space.
Yeah, there was no precedent for doing that, but it was actually
intentional: there were enough cases where we would extend the relation,
only to have the operation fail for unrelated reasons (things like a
concurrent heap insertion), so it was possible to lose free space very
fast, and since BRIN indexes are slow to grow it would become
noticeable. We could have waited for FreeSpaceMapVacuum to occur
naturally, but this would cause the index to bloat until next vacuum,
which could happen a long time later. Also, the idea behind BRIN is
that the indexed tables are very large, so vacuuming is infrequent.
Fair enough. It seemed like an odd thing, but as long as it was
intentional I'm good with it.
some places that do RecordPageWithFreeSpace follow it up with an immediate
FreeSpaceMapVacuum, and some don't.
Hmm I analyzed the callsites carefully to ensure it wasn't possible to
bail out without vacuuming in the normal corner cases. Maybe you
spotted some gap in my analysis -- or worse, maybe I had to change
nearby code for unrelated reasons and broke it afterwards inadvertently.
Well, I did not trace the logic fully, but there are places that record
free space and don't have an obviously connected FreeSpaceMapVacuum
call. Maybe those all expect the caller to do it.
I agree that brin_getinsertbuffer should have a better comment to
explain the intention (and that routine seems to carry the most guilt in
failing to do the FSM vacuuming).
Got a proposal?
I can take care of the other stuff, unless you wanted to.
regards, tom lane
I wrote:
[ assorted complaining about BRIN FSM management ]
Here's a proposed patch for this. Changes:
* Do RecordPageWithFreeSpace unconditionally in brin_page_cleanup,
and do FreeSpaceMapVacuum unconditionally in brin_vacuum_scan.
Because of the FSM's roundoff behavior, the old complications here
weren't really buying any savings.
* Elsewhere, we are trying to update the FSM for just a single-page
status update, so use FreeSpaceMapVacuumRange() to limit how much
of the upper FSM gets traversed.
* Fix a couple of places that neglected to do the upper-page
vacuuming at all after recording new free space. If the policy
is to be that BRIN should do that, it should do it everywhere.
* Remove premature RecordPageWithFreeSpace call in brin_getinsertbuffer
where it's about to return an extended page to the caller. The caller
should do that, instead, after it's inserted its new tuple. Fix the
one caller that forgot to do so.
* Simplify logic in brin_doupdate's same-page-update case by
postponing brin_initialize_empty_new_buffer to after the critical
section; I see little point in doing it before.
* Avoid repeat calls of RelationGetNumberOfBlocks in brin_vacuum_scan.
* Avoid duplicate BufferGetBlockNumber and BufferGetPage calls in
a couple of places where we already had the right values.
* Move a BRIN_elog call out of a critical section; that's pretty
unsafe and I don't think it buys us anything to not wait till
after the critical section.
* I moved the "*extended = false" step in brin_getinsertbuffer into
the routine's main loop. There's no actual bug there, since the loop
can't iterate with *extended still true, but it doesn't seem very
future-proof as coded; and it's certainly not documented as a loop
invariant.
* Assorted comment improvements.
The use of FreeSpaceMapVacuumRange makes this a HEAD-only patch.
I'm not sure if any of the other changes are worth back-patching.
regards, tom lane