The Free Space Map: Problems and Opportunities

Started by Peter Geogheganalmost 5 years ago48 messageshackers

pg@bowt.ie

almost 5 years ago

I have suspected that there are serious design issues with the FSM
(and related heapam code like hio.c) for several years now [1]/messages/by-id/CAH2-WzkEP6wz2Eg7mgKbF-qTPi21+BWrJgNm0qfU5kf0iJBV2g@mail.gmail.com. This
has a lot to do with the experience of using Jan Wieck's BenchmarkSQL
implementation of TPC-C [2]/messages/by-id/0265f9e2-3e32-e67d-f106-8abde596c0e4@commandprompt.com[3]https://github.com/wieck/benchmarksql/commit/6ac326ad23d55de2edc18bfbffb10b21f1b39b48[4]https://github.com/wieck/benchmarksql/commit/0830c5f526061529eb2b45831c544539caa65435. It clearly makes Postgres exhibit
pathological performance issues, especially heap fragmentation.

There is a clear way in which free space in the two largest tables
(orders and order lines) ought to be managed, given the fixed pattern
of the workload, but that isn't what we see. I had the opportunity to
work with Jan Wieck and Greg Smith directly on this, shortly before I
left Crunchy Data several weeks ago. Jan was able to fix some issues
on the BenchmarkSQL side, which helped a lot. But the real problems
remain. It is generally understood that the FSM leads to Postgres
doing badly with TPC-C. I personally believe that this is
representative of real practical problems, and not just a problem with
this one benchmark.

This email is an attempt to impose some order on a disorderly problem
space. I'd like to compare notes with other people that are interested
in the problem. I suspect that some experienced hackers have yet to be
convinced of the importance of the FSM when it comes to bloat.
Hopefully I'll manage to make that case convincingly on this thread.

If any of my more concrete claims about problems in the FSM seem like
they need to be justified, please let me know. I am omitting details
in this initial overview email for the sake of brevity. It's already
too long.

Problems
--------

The FSM gives out space without considering the passage of time, or
the likelihood that a particular transaction or client will consume
its requested free space in a more or less predictable and steady way
over time. It has no memory. The FSM therefore fails to capture
naturally occurring locality, or at least doesn't specifically care
about it. This is the central problem, that all other problems with
the FSM seem to stem from in one way or another.

The FSM should group related rows (e.g. rows from the same transaction
or backend) together, so that they reliably go on the same heap page
-- careless mixing of unrelated rows should be avoided. When I say
"related", I mean related in whatever sense the application or end
user thinks of them as related. As a general rule we should expect
groups of rows that were inserted by the same transaction to also be
read, updated, deleted, or removed by VACUUM together, as a group.
While this isn't guaranteed, it's a good working assumption for the
FSM. It avoids unnecessary fragmentation. The current FSM design
causes significant fragmentation with workloads where users *expect*
no fragmentation at all. I see this with TPC-C. The way that the FSM
*ought* to behave for the workload in question is intuitively obvious,
once you lay it all out. But that's not what we see in practice -- far
from it. (Happy to go into more detail on this.)

You don't even need temporal locality to see problems. Even random
inserts show up certain FSM implementation problems. The FSM lacks
even a basic understanding of the *aggregate* result of backends
applying various FSM heuristics over time, and with concurrent access.
Inserting backends currently behave like an amateur soccer team where
every individual soccer player chases the ball, without any
coordination among team members. The players in this analogy somehow
fail to consider where the ball *is about to be* -- there is no
temporal awareness. They also miss the importance of cooperating with
each other as a team -- there is also no spatial awareness, and no
thought given to second order effects. This leads to increased buffer
lock contention and fragmentation.

(Other known FSM problems have been omitted for the sake of brevity.)

Background
----------

To recap, the FSM tracks how much free space is in every heap page.
Most FSM maintenance takes place during VACUUM, though ordinary
connection backends occasionally help to keep the FSM current, via
calls to RecordAndGetPageWithFreeSpace() made from hio.c.

There is also a bulk extension mechanism added by commit 719c84c1be,
which is used when we detect multiple concurrent inserters. This bulk
extension mechanism does change things a little (it gives some thought
to systematic effects), though it still has the same basic issues: it
doesn't do nearly enough to group likely-related rows together. I
suspect that improving the bulk extension mechanism will be an
important part of improving the overall picture. That mechanism needs
to be more explicit, and more careful about who gets what free space
when. I'll go into the significance of this mechanism to the FSM
below, under "Opportunities". But first, more background information.

Papers
------

I've taken time to review the database research literature, which
doesn't have all that much to say here. But there are a couple of
relevant classic papers that I know of [6]https://dl.acm.org/doi/abs/10.1145/235968.233355[7]https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605.

A lot of the considerations for free space management in heap-based
systems like DB2 and Oracle are naturally intertwined with concurrency
control, recovery, and even heavyweight locking. These systems must
treat TIDs as stable identifiers of a logical row, and not identifiers
of a physical tuple -- another important difference. "Free space" is
only truly available to reuse in these systems some time after a
deleter commits and releases its heavyweight locks. This is why their
FSM equivalents must be WAL-logged. There is even a need for logical
UNDO to cover these FSM-equivalent data structures, which have their
own tricky rollback path issues that align with those in the heap
structure. Everything is tied together to avoid rollback safety
hazards. Physical rollback cannot find that the tuple needed to
restore form UNDO will no longer physically fit on the original heap
page. Plus the line pointer can't just be recycled. It's all quite
delicate.

At first I thought that this UNDO rollback safety business meant that
I had nothing to learn from these old papers. Postgres naturally
doesn't have to care about a transaction reusing space before another
transaction that deleted the space commits (or after an inserter
aborts) -- in Postgres, space is just that: space. We always have the
*option* to create a new HOT chain on another page. However, I
eventually changed my mind -- these old papers are relevant. In a way
Postgres actually does have rollback. It doesn't involve delicate
physical handling, in large part because TID stability isn't
guaranteed in Postgres. But Postgres style "logical rollback" is more
similar than it is different.

Logical database
----------------

While it certainly is good that Postgres has the freedom to allow the
physical representation of the logical database to diverge in all
kinds of ways, that doesn't mean that we should be totally unconcerned
about just how far the divergence goes. It's good that we have the
option to retain a few extra versions without considering rollback
hazards. But an option isn't an obligation. That's what makes an
option useful -- it gives us the ability to have our cake, and eat it
too.

Clearly we should try to avoid the need for a new HOT chain on another
heap block, for example. OTOH we shouldn't bend over backwards to make
sure of it -- that can't be worth it, since it isn't a matter of
fundamental correctness with Postgres style concurrency control. As I
said, we should try to be a little bit like ARIES with full UNDO
rollback, purely to avoid bloat and improve temporal locality --
concurrent inserters should not clobber the same heap pages. I believe
that something a little like FSM transaction rollback is just what we
need here. A design that is inspired by other existing designs,
including even the UNDO + rollback parts, which are pretty essential.
Transaction boundaries and ownership semantics for free space really
do seem to matter when it comes to avoiding bloat.

Roughly the same high level approach led to my developing bottom-up
index deletion for Postgres 14 [5]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf -- the same "logical vs physical"
tension exists in Postgres B-Tree indexes. Using "logical database
surrogates" in the physical database is actually a standard family of
techniques used in DB systems design. This family of techniques has
been underexploited within Postgres, probably because it isn't so
natural in a world without UNDO style rollback. This family of
techniques is something to keep in your "bag of tricks" as a Postgres
hacker (as Hellerstein once put it).

I'll return to how we might actually do something a bit like UNDO
style rollback in the FSM later on. It is counterintuitive, but stay
with me.

Open and closed pages
---------------------

Note that other designs for FSM style structures (for a heap table
access method) all seem to have fewer than 10 possible "page has this
much free space" increments [6]https://dl.acm.org/doi/abs/10.1145/235968.233355 -- the granularity is far coarser than
our own FSM_CATEGORIES granularity (which has a massive 255 distinct
increments of free space for each heap page). One reason for this is
that ignoring tiny differences avoids contention and fragmentation
from chasing insignificant differences between heap blocks, which I
don't need to go into again (my earlier amatuer soccer team analogy is
enough for now). But it's also related to the logical
database/rollback stuff that I just went into. It sometimes makes
sense to think of a physical heap page as more than just a data
structure. Even things that we tend to naturally think of as strictly
physical may need to be rethought just a little.

Take DB2's version of the FSM, FSIP [7]https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605. This design usually doesn't
ever end up inserting *any* new logical rows on a heap page after the
page first fills -- it is initially "open" when first allocated, and
then quickly becomes "closed" to inserts of new logical rows, once it
crosses a certain almost-empty space threshold. The heap page usually
stays "closed" forever. While the design does not *completely* ignore
the possibility that the page won't eventually get so empty that reuse
really does look like a good idea, it makes it rare. A heap page needs
to have rather a lot of the original logical rows deleted before that
becomes possible again. It's rare for it to go backwards because the
threshold that makes a closed page become open again is much lower
(i.e. involves much more free space) than the threshold that initially
made a (newly allocated) heap page closed. This one-way stickiness
quality is important. As with simulated annealing algorithms, our
algorithm has heap pages that naturally settle. Our algorithm makes it
hard to change things once they become settled -- it has to be really
worth it before we'll flip a page back to "open" again. There is a
general bias against disturbing that which has become the settled
state. (I think that the terms "open" and "closed" come from the world
of online bin packing problems, where the same tension and
uncertainties exist.)

This stickiness concept is called "hysteresis" by some DB researchers,
often when discussing UNDO stuff [8]https://queue.acm.org/detail.cfm?id=3099561. Having *far less* granularity
than FSM_CATEGORIES/255 seems essential to make that work as intended.
Pages need to be able to settle without being disturbed by noise-level
differences. That's all that super fine grained values buy you: more
noise, more confusion.

Visibility map
--------------

If the logical database and natural locality are important to the FSM,
then what about the visibility map? And what about the relationship
between the FSM and the visibility map, in light of all this?

Currently VACUUM doesn't care about how its FSM behavior interacts
with how it sets all-frozen/all-visible bits for the same heap page.
To me this seems completely unreasonable -- they're *obviously*
related! We're probably only gaining a minimal amount of free space on
one occasion by ignoring the VM/FSM relationship, for which we pay a
high cost. Worst of all we're *perpetuating the cycle* of dirtying and
redirtying the same pages over time. Maybe we should go as far as
merging the FSM and VM, even -- that seems like a natural consequence
of logical-ish/qualitative definitions of "page is full".

Opportunities
-------------

Now more about what I think a new FSM design for Postgres *should* do.
This is almost a description of an actual data structure. A new FSM.
But not quite.

As we've seen, free space management in a true ARIES-style system
(like the DB2 FSIP design) is forced to think of free space in terms
of "leases", or bulk-allocated free lists of usually-contiguous pages.
Under a true ARIES design, it is strictly necessary to track which
deleter XID freed-up which individual lease of free space lest some
concurrent inserter reuse the space before it is truly safe to do so
-- the deleter xact must commit before that reuse can be safely
allowed. While delete + commit does seem mostly irrelevant to the
Postgres FSM, insert + abort case handling seems to be something that
we can learn from. Maybe the same principles can be applied in
Postgres, where we need leases to avoid leaks in the presence of
errors (not necessarily rollback per se). We must be able to tolerate
incorrect speculations about where free space will be needed next.
Transactions must be given the opportunity to make use of the heap
free space lease that they asked for -- but not forever. It's a
balancing act.

Imagine if the FSM tracked recent free space requests that have been
satisfied already. Requesting backends will maintain this same
metadata when they burn through their bulk allocation of reserved
space/heap pages. This new structure would contain metadata like the
requesting XID/backend for a given lease of heap pages, the number of
leases during the last bulk extension operation, and maybe the number
of concurrently waiting backends at that time. Now we have some sense
of recent history, and start to recognize trends over time. The FSM is
now able to *take back* some of the space that it gives out, based on
new information. Now the FSM can afford to make mistakes because the
mistakes can always be corrected a little later. The FSM can be
generous based on speculation, heuristics, whatever it might be. It
should be possible for the FSM to have it both ways, more or less.

Consider the bulk extension mechanism for concurrent inserters that
was added by commit 719c84c1be once again.

The mechanism's leader backend process performs the actual bulk heap
relation extension, allocating up to 512 new heap pages all at once.
Even when the leader backend ends up extending a heap relation by
quite a few pages (hundreds), the leader is still unable to *directly*
hand off free space to particular backends that drove the leader to
extend aggressively in the first place. It would be quite natural for
the leader backend to say to each individual follower backend: thank
you for waiting for me, here is a big contiguous range of heap pages
(e.g., as many as 50 heap pages or more per xact when things truly
ramp up), which should be enough to allow you to avoid going to the
FSM again for quite a long time.

But the leader backend doesn't do that at all -- it just puts all of
the pages (up to 512 total) in the FSM. All of the waiting follower
backends are left to senselessly fight it out for free space by simply
going back to the FSM when they wake up again. Why can't they
cooperate intelligently? What stops that today?

FSM "rollback"
-------------

(This is where my argument comes together, finally.)

FSM rollback in the FSIP paper is really just about reliable ownership
semantics. Ownership of free space in heap pages that is scoped at the
level of transactions.

It seems to me that ownership of free space in heap pages by
particular transactions is really all that the bulk extension FSM
logic is missing -- and getting that right seems like a big part of
fixing the FSM comprehensively. The bulk extension mechanism cannot
*trust* the backends to use more than a little space at a time today.
While on average each backend's use of free space probably *is*
predictable over time, it only takes one or two straggler backends per
bulk operation to mess everything up -- that's enough for the system
to end up with very poor space utilization. Since we don't have any
sense of ownership of space in heap pages by transactions today, the
code must hedge against stragglers/space leaks by making the space
*generally* available to all. Of course, this hedging comes at a high
cost: it prevents the whole system (heapam, FSM, and backends) from
capturing naturally occurring locality in a variety of important
cases. Including the important TPC-C case I mentioned.

Explicit scoped ownership of free space in heap pages removes the need
to hedge. It addresses the problem of would-be space leaks directly,
while *also* fixing the FSM locality problems implicitly. Metadata
about recent requests gives us the *context* needed to do all this
well. Sure, we don't need it for transaction rollback using ARIES
style UNDO, but we can still use it for this. Perhaps this can be
pushed much further. Long running transactions ought to have as much
time as they need to use the large lease of free space that they were
provided. But when the transaction commits or aborts, we might then
make the FSM code much more aggressive about stealing space back from
the backend that inherits the lease from the now-committed
transaction. It all depends.

Maybe we can teach VACUUM to eagerly clean up aborted transactions
that are known to have consumed a lot of free space that turns out to
not be needed for the inserted rows. It might also be feasible to make
our new slimmed-down FSM crash-safe. The way that REDO routines for
VACUUM handle FSM maintenance is pretty sloppy right now.

Bean counting
------------

In general I find the idea that an algorithm can always make better
decisions with more information dubious. Accounting for every tiny
little fragment of free space ("bean counting") is definitely not an
intrinsic good. Maybe keeping that information is not just wasteful --
maybe it's actually harmful. There is a real risk of overinterpreting
noise-level difference [9]https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Approaches -- Peter Geoghegan. I'd prefer to err in the direction of
underfitting.

I now wonder if the FSM is fundamentally doing the wrong thing by
keeping track of all "free space" in every page over time. Why
wouldn't we just have a free space management strategy that is
generally only concerned with recent events? If we find a way to make
almost all newly allocated heap pages become "closed" quickly (maybe
they're marked all-frozen quickly too), and make sure that that
condition is sticky, then this can work out well. We may not even need
to store explicit freespace information for most heap pages in the
database -- a page being "closed" can be made implicit by the FSM
and/or VM. Making heap pages age-out like this (and mostly stay that
way over time) has obvious benefits in many different areas.

This won't help workloads like stock pgbench, which is pretty
unrealistic anyway. But it needn't hurt them either.

[1]: /messages/by-id/CAH2-WzkEP6wz2Eg7mgKbF-qTPi21+BWrJgNm0qfU5kf0iJBV2g@mail.gmail.com
[2]: /messages/by-id/0265f9e2-3e32-e67d-f106-8abde596c0e4@commandprompt.com
[3]: https://github.com/wieck/benchmarksql/commit/6ac326ad23d55de2edc18bfbffb10b21f1b39b48
[4]: https://github.com/wieck/benchmarksql/commit/0830c5f526061529eb2b45831c544539caa65435
[5]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf
[6]: https://dl.acm.org/doi/abs/10.1145/235968.233355
[7]: https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605
[8]: https://queue.acm.org/detail.cfm?id=3099561
[9]: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Approaches -- Peter Geoghegan
--
Peter Geoghegan

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Peter Geoghegan (#1)

Re: The Free Space Map: Problems and Opportunities

On Mon, Aug 16, 2021 at 10:35:45AM -0700, Peter Geoghegan wrote:

I have suspected that there are serious design issues with the FSM
(and related heapam code like hio.c) for several years now [1]. This
has a lot to do with the experience of using Jan Wieck's BenchmarkSQL
implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit
pathological performance issues, especially heap fragmentation.

OK, here is my feedback. First, I understand the space
reservation/soccer idea, but I am also concerned that adding space
reservation could improve some things and make others worse.

Second, let's look at a concrete example, and see how it can be improved
more simply. As I understand it, there are three operations that need
free space:

1. INSERT/COPY
2. UPDATE where there is insufficient space on the page of the
old row
3. UPDATE where there is sufficient page space

The third option already can use the existing page for a HOT update, so
the FSM doesn't matter. For 1 & 2, suppose you have a table with 10 8k
pages, all 80% full. As I understand it, the first page will go from
80% to 81%, then the next row addition will take the second page from
80% to 81%, until all pages are 81%, and then it starts over again. Is
that accurate? Is that the non-temporal memory issue you are concerned
about? Doesn't spreading the new rows out increase the ability to do
HOT updates?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Bruce Momjian (#2)

Re: The Free Space Map: Problems and Opportunities

On Mon, Aug 16, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:

OK, here is my feedback. First, I understand the space
reservation/soccer idea, but I am also concerned that adding space
reservation could improve some things and make others worse.

That is definitely a legitimate concern.

There is a trade-off here: if we do too much preallocation, there may
be cases where the preallocated space that we expected to be used
quickly is never used by anybody. But that doesn't seem like a
problem, provided we don't lose track of the fact that it happened.
Then the worst that can happen is that we have empty pages that nobody
will ever use, because nobody inserts into the table ever again (not
the backend with the leaked space lease, not anybody else). That seems
manageable. We can add some kind of ramp-up behavior that gets more
and more aggressive as demand for new space from backends is steady or
increasing.

Second, let's look at a concrete example, and see how it can be improved
more simply. As I understand it, there are three operations that need
free space:

1. INSERT/COPY
2. UPDATE where there is insufficient space on the page of the
old row
3. UPDATE where there is sufficient page space

Right.

The third option already can use the existing page for a HOT update, so
the FSM doesn't matter.

I agree.

All good questions. Thank you for diving in on this.

For 1 & 2, suppose you have a table with 10 8k
pages, all 80% full. As I understand it, the first page will go from
80% to 81%, then the next row addition will take the second page from
80% to 81%, until all pages are 81%, and then it starts over again. Is
that accurate?

No. Generally backends have their own target block (a simple
BlockNumber) that they store in the relcache, one per table recently
modified. Backends/heapam uses RelationGetTargetBlock() to access this
local cache. So there is "backend affinity" for particular individual
heap pages, even today. However, that still has many problems.

It doesn't make sense to have a local cache for a shared resource --
that's the problem. You actually need some kind of basic locking or
lease system, so that 10 backends don't all decide at the same time
that one particular heap block is fully empty, and therefore a good
target block for that backend alone. It's as if the backends believe
that they're special little snowflakes, and that no other backend
could possibly be thinking the same thing at the same time about the
same heap page. And so when TPC-C does its initial bulk insert,
distinct orders are already shuffled together in a hodge-podge, just
because concurrent bulk inserters all insert on the same heap pages.

If we could safely give out 10 or 50 pages directly to each backend,
then clearly the related data would be stored together in this
scenario -- each backend would be able to work through its lease of
contiguous heap pages for quite a while before revisiting the
FSM/relation extension. The trick is to teach the implementation to do
that without allowing the backend to permanently leak whole entire
leases with maybe dozens of free pages.

Systems like DB2 and Oracle probably can't have this problem. Even the
bulk lease case is just an extension of something they had to do
anyway. You see, some kind of FSM lease system that knows about
transaction lifetime for any given piece of free space is strictly
necessary with UNDO based rollback. Without that, transaction rollback
breaks in certain edge cases involving concurrent inserts into the
same page, right before a rollback that needs to put an updated
version back in place. If the updated version from undo happens to be
physically larger than the now-aborted successor version, and if there
is little or no space left to cover that, what can rollback do about
it? Nothing. It's a nasty bug. They use heavyweight locks to avoid
this.

Is that the non-temporal memory issue you are concerned
about?

When I talk about memory or time, what I'm usually referring to is the
ability to manage larger allocations of multiple blocks for a while
after they're originally requested. So that a requesting
transaction/backend is given the opportunity to use the space that
they asked for. They shouldn't be completely trusted. We should trust
but verify.

Doesn't spreading the new rows out increase the ability to do
HOT updates?

It's true that the problem scenario with TPC-C does involve updates,
and it's true that that's when things really go badly. But I
deliberately haven't gone into the role that the updates play there
too much. Not just yet.

It's important to note that the problems really do start when we
insert, even if that's not obvious -- that's when the locality is
lost, right when the original order transaction comes in. We lose
locality just because we have concurrent inserters into the same
table, where each transaction inserts multiple related rows. We fail
to group related rows together right from the start, setting us up for
failure later on.

This is actually quite simple -- the chaos that follows is where it
gets truly complicated (and not necessarily worth getting into just
yet). The update of a given order and its order lines entries takes
place hours later, within a delivery transaction.

Another concern is knock on effects *after* any initial problematic
updates -- that's certainly not where it ends. Every action has
consequences, and these consequences themselves have consequences. By
migrating a row from an earlier block into a later block (both
physically and temporally early/late), we're not just changing things
for that original order row or rows (the order non-HOT-updated by the
delivery transaction). Since the old/original version left behind by a
delivery transaction will itself get vacuumed eventually, that space
goes in the FSM eventually. And so this same block will get some entirely
unrelated version (could be a new insert or update). It's systematic.

We dare not waste a tiny amount of free space. Surely we should cut
our losses at some point and accept that we're "wasting space", but
there is never any natural backpressure. Nothing to break the cycle of
fragmentation. (This is another thing involving time, not quite
related to my central point about free space leases.)

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#3)

Re: The Free Space Map: Problems and Opportunities

On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

Another concern is knock on effects *after* any initial problematic
updates -- that's certainly not where it ends. Every action has
consequences, and these consequences themselves have consequences. By
migrating a row from an earlier block into a later block (both
physically and temporally early/late), we're not just changing things
for that original order row or rows (the order non-HOT-updated by the
delivery transaction).

To be clear, TPC-C looks like this: each order row and each order line
row will be inserted just once, and then later updated just once. But
that's it, forever -- no more modifications. Both tables grow and grow
for as long as the benchmark runs. Users with workloads like this will
naturally expect that performance will steady over time. Even over
days or even weeks. But that's not what we see.

What we actually see is that the FSM can never quite resist the
temptation to dirty older pages that should be aging out. And so
performance degrades over days and weeks -- that's how long Jan has
said that it can take.

The FSM does have a bias in favor of using earlier pages in favor of
later pages, in order to make relation truncation by VACUUM more
likely in general. That bias certainly isn't helping us here, and
might be another thing that hurts us. I think that the fundamental
problem is that the FSM just doesn't have any way of recognizing that
it's behavior is penny wise, pound foolish. I don't believe that there
is any fundamental reason why Postgres can't have *stable* long term
performance for this workload in the way that it's really expected to.
That seems possible within the confines of the existing design for HOT
and VACUUM, which already work very well for the first few hours.

--
Peter Geoghegan

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Peter Geoghegan (#1)

Re: The Free Space Map: Problems and Opportunities

Hi,

On 2021-08-16 10:35:45 -0700, Peter Geoghegan wrote:

Problems
--------

The FSM gives out space without considering the passage of time, or
the likelihood that a particular transaction or client will consume
its requested free space in a more or less predictable and steady way
over time. It has no memory. The FSM therefore fails to capture
naturally occurring locality, or at least doesn't specifically care
about it. This is the central problem, that all other problems with
the FSM seem to stem from in one way or another.

I suspect that one other fairly fundamental issue is that we are very
reticent updating the FSM with information about free space. As long as
that's the case a smarter placement logic would often not have a better
place to choose from.

Background
----------

To recap, the FSM tracks how much free space is in every heap page.
Most FSM maintenance takes place during VACUUM, though ordinary
connection backends occasionally help to keep the FSM current, via
calls to RecordAndGetPageWithFreeSpace() made from hio.c.

There is also a bulk extension mechanism added by commit 719c84c1be,
which is used when we detect multiple concurrent inserters. This bulk
extension mechanism does change things a little (it gives some thought
to systematic effects), though it still has the same basic issues: it
doesn't do nearly enough to group likely-related rows together. I
suspect that improving the bulk extension mechanism will be an
important part of improving the overall picture.

I think the design of the current bulk extension mechanism is quite backward
(it only takes contention into account not other needs for bulk work, it does
expensive stuff like finding victim buffers under lock, it doesn't release
locks until all blocks are extended, ...). But: I don't think a bulk
extension mechanism should be tasked with doing grouping or anything like
that.

Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
ever end up inserting *any* new logical rows on a heap page after the
page first fills -- it is initially "open" when first allocated, and
then quickly becomes "closed" to inserts of new logical rows, once it
crosses a certain almost-empty space threshold. The heap page usually
stays "closed" forever. While the design does not *completely* ignore
the possibility that the page won't eventually get so empty that reuse
really does look like a good idea, it makes it rare. A heap page needs
to have rather a lot of the original logical rows deleted before that
becomes possible again. It's rare for it to go backwards because the
threshold that makes a closed page become open again is much lower
(i.e. involves much more free space) than the threshold that initially
made a (newly allocated) heap page closed. This one-way stickiness
quality is important.

I suspect that we'd get a *LOT* of complaints if we introduced that degree of
stickiness.

This stickiness concept is called "hysteresis" by some DB researchers,
often when discussing UNDO stuff [8]. Having *far less* granularity
than FSM_CATEGORIES/255 seems essential to make that work as intended.
Pages need to be able to settle without being disturbed by noise-level
differences. That's all that super fine grained values buy you: more
noise, more confusion.

Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
from a fairly granular FSM_CATEGORIES, while still achieving better
locality. You could e.g. search for free space for an out-of-page update in
nearby pages with a lower free-percentage threshold, while having a different
threshold and selection criteria for placement of inserts.

Visibility map
--------------

If the logical database and natural locality are important to the FSM,
then what about the visibility map? And what about the relationship
between the FSM and the visibility map, in light of all this?

Currently VACUUM doesn't care about how its FSM behavior interacts
with how it sets all-frozen/all-visible bits for the same heap page.
To me this seems completely unreasonable -- they're *obviously*
related! We're probably only gaining a minimal amount of free space on
one occasion by ignoring the VM/FSM relationship, for which we pay a
high cost. Worst of all we're *perpetuating the cycle* of dirtying and
redirtying the same pages over time. Maybe we should go as far as
merging the FSM and VM, even -- that seems like a natural consequence
of logical-ish/qualitative definitions of "page is full".

The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
merging FSM and VM is a good direction to go to. Additionally we need to make
sure that the VM is accurately durable, which isn't the case for the FSM
(which I think we should use to maintain it more accurately).

But perhaps I'm just missing something here. What is the strong interlink
between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
that the main linkage is on the side of inserts/update that choose a target
page from the FSM. Isn't it better to make that side smarter?

Imagine if the FSM tracked recent free space requests that have been
satisfied already. Requesting backends will maintain this same
metadata when they burn through their bulk allocation of reserved
space/heap pages. This new structure would contain metadata like the
requesting XID/backend for a given lease of heap pages, the number of
leases during the last bulk extension operation, and maybe the number
of concurrently waiting backends at that time. Now we have some sense
of recent history, and start to recognize trends over time. The FSM is
now able to *take back* some of the space that it gives out, based on
new information. Now the FSM can afford to make mistakes because the
mistakes can always be corrected a little later. The FSM can be
generous based on speculation, heuristics, whatever it might be. It
should be possible for the FSM to have it both ways, more or less.

Hm. To me this seems like a somewhat separate layer from the FSM. Having that
kind of data go through shared buffers, with a need to worry about on-disk
consistency/cleanup etc, the inability to add granular locks on a sub-page
level, the need to go through buffer mapping / pinning all seem like things
you wouldn't actually want.

To me this seems like it'd be better addressed by a shared, per-relfilenode,
in-memory datastructure. Thomas Munro has been working on keeping accurate
per-relfilenode relation size information. ISTM that that that'd be a better
place to hook in for this.

would be quite natural for the leader backend to say to each individual
follower backend: thank you for waiting for me, here is a big contiguous
range of heap pages (e.g., as many as 50 heap pages or more per xact when
things truly ramp up), which should be enough to allow you to avoid going to
the FSM again for quite a long time.

That'd end up badly if we did for a relation that's hotly inserted into by a
lot of connections, but where each connection only inserts a few rows. I
suspect we'd either need information from the inserting backends about the
number of pages they're likely going to need (e.g. a smarter "extension queue"
where each entry lists the number of pages requested) and/or stats about the
number of consecutive insertions a relation typically gets.

But the leader backend doesn't do that at all -- it just puts all of
the pages (up to 512 total) in the FSM. All of the waiting follower
backends are left to senselessly fight it out for free space by simply
going back to the FSM when they wake up again. Why can't they
cooperate intelligently? What stops that today?

The lack of suitable in-memory coordination facilities.

It turns out that I looked a bunch at this area because of AIO + DIO: With DIO
the typical linux filesystems (ext4, xfs) do not perform any delayed
allocation, which means that the non-bulk extension leads to *horrible*
filesystem fragmentation. And even the bulk paths don't allocate enough blocks
at once to avoid fragmentation with common amounts of concurrency. But just
increasing the degree of bulk extension isn't helpful - it just exacerbates
already bad thundering herd issues: All backends will decide to try to extend
at the same time, all-1 backends will wait, one backend will extend wake up
all-1, they then contend for the same locks etc.

I addressed (or rather sidestepped) this to a small degree in the AIO+DIO
patchset, by making extension faster and doing a few improvements around
locking. But it's not good enough.

If the relation extension infrastructure instead had information about the
number of blocks each waiter wants and we had Thomas' "accurate shared
relation size" infrastructure, we could do a lot better:

1) Whenever a backend needs new space and most of the previously bulk-extended
space is used opportunistically try to bulk-extend. In the happy path this
avoids the thundering herd issue because other backends can continue
inserting while extension is in progress.

2) Whenever bulk extending, release pages to individual backends by looking in
the queue of backends needing extension and wake up backends individually
whenever we've bulk extended sufficiently for that backend.

This provides two main benefits: First, it avoids all backends trying to
acquire exactly the same resources just after waking up, without being able
to benefit (because the first backend will often just fill the
page). Second, it improves average latency during extension substantially,
because individual backends can start continuing as soon as bulk extension
has progressed enough for some backend, rather than needing to wait for the
entire extension.

FSM "rollback"
-------------

(This is where my argument comes together, finally.)

FSM rollback in the FSIP paper is really just about reliable ownership
semantics. Ownership of free space in heap pages that is scoped at the
level of transactions.

It seems to me that ownership of free space in heap pages by
particular transactions is really all that the bulk extension FSM
logic is missing -- and getting that right seems like a big part of
fixing the FSM comprehensively. The bulk extension mechanism cannot
*trust* the backends to use more than a little space at a time today.
While on average each backend's use of free space probably *is*
predictable over time, it only takes one or two straggler backends per
bulk operation to mess everything up -- that's enough for the system
to end up with very poor space utilization. Since we don't have any
sense of ownership of space in heap pages by transactions today, the
code must hedge against stragglers/space leaks by making the space
*generally* available to all. Of course, this hedging comes at a high
cost: it prevents the whole system (heapam, FSM, and backends) from
capturing naturally occurring locality in a variety of important
cases. Including the important TPC-C case I mentioned.

I think the fundamental issue of bulk extension issue is more that bulk
extension shouldn't actually get pages from the FSM at all. That's just an
ugly implementation detail of the hack we have currently, rather than
something that should live on. We may want to decide to continue inserting
bulk extended pages into the FSM to handle edge cases of backends erroring out
before space is used, but the happy path should not touch the FSM at all.

Explicit scoped ownership of free space in heap pages removes the need
to hedge. It addresses the problem of would-be space leaks directly,
while *also* fixing the FSM locality problems implicitly. Metadata
about recent requests gives us the *context* needed to do all this
well.

It sounds like you're thinking of using some global history, or at least a
backend local history? I think it might be better to have backend local,
per-command state. E.g. for a COPY aggressively increase the amount of space
reserved, but not for a single-row insert.

I now wonder if the FSM is fundamentally doing the wrong thing by
keeping track of all "free space" in every page over time. Why
wouldn't we just have a free space management strategy that is
generally only concerned with recent events?

I guess it depends on what you define as recent, but I don't see a way to
bound the timeframe usefully. We need to deal with cases where table sizes go
up and down over time, e.g. because there's periodic batch activity and
ongoing activity. What's the time bound for unused space somewhere in a
relation? I don't see any, unless we add compaction logic to vacuum.

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Andres Freund (#5)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:

Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
ever end up inserting *any* new logical rows on a heap page after the
page first fills -- it is initially "open" when first allocated, and
then quickly becomes "closed" to inserts of new logical rows, once it
crosses a certain almost-empty space threshold. The heap page usually
stays "closed" forever. While the design does not *completely* ignore
the possibility that the page won't eventually get so empty that reuse
really does look like a good idea, it makes it rare. A heap page needs
to have rather a lot of the original logical rows deleted before that
becomes possible again. It's rare for it to go backwards because the
threshold that makes a closed page become open again is much lower
(i.e. involves much more free space) than the threshold that initially
made a (newly allocated) heap page closed. This one-way stickiness
quality is important.

I suspect that we'd get a *LOT* of complaints if we introduced that degree of
stickiness.

I don't know what the right degree of stickiness is, but I can easily
believe that we want to have more than none. Part of Peter's point, at
least as I understand it, is that if a page has 100 tuples and you
delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
created with 1 or 2 or 3 new tuples. You should instead hope that
space will optimize future HOT updates, or else wait for the page to
lose some larger number of tuples and then fill up all the space at
the same time with tuples that are, hopefully, related to each other
in some useful way. Now what's the threshold? 20 out of 100? 50? 80?
That's not really clear to me. But it's probably not 1 out of 100.

This stickiness concept is called "hysteresis" by some DB researchers,
often when discussing UNDO stuff [8]. Having *far less* granularity
than FSM_CATEGORIES/255 seems essential to make that work as intended.
Pages need to be able to settle without being disturbed by noise-level
differences. That's all that super fine grained values buy you: more
noise, more confusion.

Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
from a fairly granular FSM_CATEGORIES, while still achieving better
locality. You could e.g. search for free space for an out-of-page update in
nearby pages with a lower free-percentage threshold, while having a different
threshold and selection criteria for placement of inserts.

Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
much value for its own sake. But it might be useful as a way of
accomplishing some other goal. For example if we decided that we need
a bit per page to track "open" vs. "closed" status or something of
that sort, I don't think we'd lose much by having fewer categories.

The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
merging FSM and VM is a good direction to go to. Additionally we need to make
sure that the VM is accurately durable, which isn't the case for the FSM
(which I think we should use to maintain it more accurately).

But perhaps I'm just missing something here. What is the strong interlink
between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
that the main linkage is on the side of inserts/update that choose a target
page from the FSM. Isn't it better to make that side smarter?

I don't have a well-formed opinion on this point yet. I am also
skeptical of the idea of merging the FSM and VM, because it seems to
me that which pages are all-visible and which pages are full are two
different things. However, there's something to Peter's point too. An
empty page is all-visible but still valid as an insertion target, but
this is not as much of a contradiction to the idea of merging them as
it might seem, because marking the empty page as all-visible is not
nearly as useful as marking a full page all-visible. An index-only
scan can't care about the all-visible status of a page that doesn't
contain any tuples, and we're also likely to make the page non-empty
in the near future, in which case the work we did to set the
all-visible bit and log the change was wasted. The only thing we're
accomplishing by making the page all-visible is letting VACUUM skip
it, which will work out to a win if it stays empty for multiple VACUUM
cycles, but not otherwise.

Conceptually, you might think of the merged data structure as a
"quiescence map." Pages that aren't being actively updated are
probably all-visible and are probably not great insertion targets.
Those that are being actively updated are probably not all-visible and
have better chances of being good insertion targets. However, there
are a lot of gremlins buried in the word "probably." A page could get
full but still not be all-visible for a long time, due to long-running
transactions, and it doesn't seem likely to me that we can just ignore
that possibility. Nor on the other hand does the difference between an
old-but-mostly-full page and an old-but-mostly-empty page seem like
something we just want to ignore. So I don't know how a merged data
structure would actually end up being an improvement, exactly.

To me this seems like it'd be better addressed by a shared, per-relfilenode,
in-memory datastructure. Thomas Munro has been working on keeping accurate
per-relfilenode relation size information. ISTM that that that'd be a better
place to hook in for this.

+1. I had this same thought reading Peter's email. I'm not sure if it
makes sense to hook that into Thomas's work, but I think it makes a
ton of sense to use shared memory to coordinate transient state like
"hey, right now I'm inserting into this block, you guys leave it
alone" while using the disk for durable state like "here's how much
space this block has left."

--
Robert Haas
EDB: http://www.enterprisedb.com

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Peter Geoghegan (#3)

Re: The Free Space Map: Problems and Opportunities

On Mon, Aug 16, 2021 at 05:15:36PM -0700, Peter Geoghegan wrote:

It doesn't make sense to have a local cache for a shared resource --
that's the problem. You actually need some kind of basic locking or
lease system, so that 10 backends don't all decide at the same time
that one particular heap block is fully empty, and therefore a good
target block for that backend alone. It's as if the backends believe
that they're special little snowflakes, and that no other backend
could possibly be thinking the same thing at the same time about the
same heap page. And so when TPC-C does its initial bulk insert,
distinct orders are already shuffled together in a hodge-podge, just
because concurrent bulk inserters all insert on the same heap pages.

OK, I am trying to think of something simple we could test to see the
benefit, with few downsides. I assume the case you are considering is
that you have a 10 8k-page table, and one page is 80% full and the
others are 81% full, and if several backends start adding rows at the
same time, they will all choose the 80%-full page.

What if we change how we select pages with this:

1. find the page with the most free space
2. find all pages with up to 10% less free space than page #1
3. count the number of pages in #2
4. compute the proc_id modulus step #3 and use that page's offset from
step #2

For example:

1. page with most freespace is 95% free
2. pages 2,4,6,8,10 have between 86%-95% free
3. five pages
4. proc id 14293 % 5 = 3 so use the third page from #2, page 6

This should spread out page usage to be more even, but still favor pages
with more freespace. Yes, this is simplistic, but it would seem to have
few downsides and I would be interested to see how much it helps.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

John Naylor

john.naylor@enterprisedb.com

almost 5 years ago

In reply to: Peter Geoghegan (#1)

Re: The Free Space Map: Problems and Opportunities

On Mon, Aug 16, 2021 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

Open and closed pages
---------------------

This stickiness concept is called "hysteresis" by some DB researchers,
often when discussing UNDO stuff [8]. Having *far less* granularity
than FSM_CATEGORIES/255 seems essential to make that work as intended.
Pages need to be able to settle without being disturbed by noise-level
differences. That's all that super fine grained values buy you: more
noise, more confusion.

I'm not sure it's essential to have "far" fewer categories, if the
closed-to-open transition is made less granular through some other
mechanism. We can certainly get by with fewer categories, freeing up bits
-- it seems we'd need at least one bit to track a block's open-close state.

Visibility map
--------------

If the logical database and natural locality are important to the FSM,
then what about the visibility map? And what about the relationship
between the FSM and the visibility map, in light of all this?

Currently VACUUM doesn't care about how its FSM behavior interacts
with how it sets all-frozen/all-visible bits for the same heap page.
To me this seems completely unreasonable -- they're *obviously*
related! We're probably only gaining a minimal amount of free space on
one occasion by ignoring the VM/FSM relationship, for which we pay a
high cost. Worst of all we're *perpetuating the cycle* of dirtying and
redirtying the same pages over time. Maybe we should go as far as
merging the FSM and VM, even -- that seems like a natural consequence
of logical-ish/qualitative definitions of "page is full".

[...]

I now wonder if the FSM is fundamentally doing the wrong thing by
keeping track of all "free space" in every page over time. Why
wouldn't we just have a free space management strategy that is
generally only concerned with recent events? If we find a way to make
almost all newly allocated heap pages become "closed" quickly (maybe
they're marked all-frozen quickly too), and make sure that that
condition is sticky, then this can work out well. We may not even need
to store explicit freespace information for most heap pages in the
database -- a page being "closed" can be made implicit by the FSM
and/or VM. Making heap pages age-out like this (and mostly stay that
way over time) has obvious benefits in many different areas.

The second paragraph here is an interesting idea and makes a great deal of
sense. It would lead to smaller FSMs that are navigated more quickly and
locked for shorter durations.

Implicit "closure" seems riskier in my view if you want to bring VM
qualities into it, however. Currently, setting an all-visible or all-frozen
flag must be correct and crash-safe, but clearing those is just a lost
optimization. If either of those qualities are implicit by lack of
reference, it seems more vulnerable to bugs.

On Tue, Aug 17, 2021 at 12:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:

To me this seems like it'd be better addressed by a shared,

per-relfilenode,

in-memory datastructure. Thomas Munro has been working on keeping

accurate

per-relfilenode relation size information. ISTM that that that'd be a

better

place to hook in for this.

+1. I had this same thought reading Peter's email. I'm not sure if it
makes sense to hook that into Thomas's work, but I think it makes a
ton of sense to use shared memory to coordinate transient state like
"hey, right now I'm inserting into this block, you guys leave it
alone" while using the disk for durable state like "here's how much
space this block has left."

This makes sense as well. Shared memory for more recent / highly contended
state, and disk for less recent / less contended / stickier state. This
also might have the advantage of smaller, more focused projects from a
coding standpoint.

--
John Naylor
EDB: http://www.enterprisedb.com

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Andres Freund (#5)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:

I suspect that one other fairly fundamental issue is that we are very
reticent updating the FSM with information about free space. As long as
that's the case a smarter placement logic would often not have a better
place to choose from.

I think so too. But that is made a lot harder by the supposed need to
represent the amount of freespace in FSM_CATEGORIES (255) distinct
increments. As opposed to maybe 4 distinct increments per page, like
in other systems. You only have to update the externally stored
metadata when a threshold is crossed.

(This isn't the main reason why I don't like that; more on that later.)

I think the design of the current bulk extension mechanism is quite backward
(it only takes contention into account not other needs for bulk work, it does
expensive stuff like finding victim buffers under lock, it doesn't release
locks until all blocks are extended, ...). But: I don't think a bulk
extension mechanism should be tasked with doing grouping or anything like
that.

Maybe I'm just drawing the boundaries differently in my head, which
doesn't seem like a real difference of opinion.

Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
ever end up inserting *any* new logical rows on a heap page after the
page first fills -- it is initially "open" when first allocated, and
then quickly becomes "closed" to inserts of new logical rows, once it
crosses a certain almost-empty space threshold.

I suspect that we'd get a *LOT* of complaints if we introduced that degree of
stickiness.

Maybe. It would be very different to our current behavior, which would
undoubtedly have many consequences. Maybe this will be a big problem,
maybe it barely matters. I make no specific claims about it, either
way. Not just yet.

Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
from a fairly granular FSM_CATEGORIES, while still achieving better
locality. You could e.g. search for free space for an out-of-page update in
nearby pages with a lower free-percentage threshold, while having a different
threshold and selection criteria for placement of inserts.

Again, it's all about *systemic* effects. A FSM request is basically a
choice *among* blocks that could in principle satisfy the request --
often the number of basically-satisfactory blocks is huge (way too
large, even). You have to bear in mind that concurrent requests are in
competition with each other in much of the time. They are usually all
looking for exactly the same thing (e.g., same free space
requirement), or close to the same thing -- tuples within a given
table tend to all be about as wide as each other.

I can clearly sometimes see very high contention cases, where backends
do way too much spinning inside the RelationGetBufferForTuple() loop.
They're all more or less chasing the same scraps of free space, in a
highly inefficient way. Even though there is probably an abundance of
free space. Right now the heap pages that have the most free space are
also the ones that are least likely to be used.

Though I think that these backends should cooperate more, some amount
of competition is probably necessary. If FSM_CATEGORIES used 3 bits
instead of 8, then the backends could fall back on caring about some
other thing that distinguished heap pages from each other that
actually matters. Leading to less contention, and maybe better space
utilization.

I have presented this FSM_CATEGORIES contention issue as a distinct
problem to the first one (the locality problem), though honestly that
was a bit arbitrary -- it's just blurry. Thank you for putting up with
that.

The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
merging FSM and VM is a good direction to go to. Additionally we need to make
sure that the VM is accurately durable, which isn't the case for the FSM
(which I think we should use to maintain it more accurately).

I explained this part badly. I don't think that we should literally
unite the VM and FSM data structures. I was really just pointing out
that a logical-ish notion of a page being full (to use the term of art
I've seen, a "closed" page) is almost the same thing as a page that is
marked all-visible/all-frozen. Presumably VACUUM does that with the
optimistic assumption that it will probably never again need to touch
the same page. And while that can never be guaranteed, it certainly
seems like we might want to weigh things in favor of that happening.

I'm pretty sure that carelessly remembering 200 bytes of free space in
the FSM for an all-frozen page is practically guaranteed to be a bad
idea. While I don't have a great idea of where the break-even point is
just yet, I am prepared to say that I'm sure that it's not 0 bytes,
which is how it works today. A squishy logical definition of "page is
full" seems pretty natural to me.

But perhaps I'm just missing something here. What is the strong interlink
between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
that the main linkage is on the side of inserts/update that choose a target
page from the FSM. Isn't it better to make that side smarter?

I'll respond to Robert about this separately.

To me this seems like it'd be better addressed by a shared, per-relfilenode,
in-memory datastructure. Thomas Munro has been working on keeping accurate
per-relfilenode relation size information. ISTM that that that'd be a better
place to hook in for this.

Actually, that's the approach that I'm taking in my current
prototyping. I would like to have a WAL-logged component to this as
well, but I haven't got that far.

I think that it's possible that I'll eventually conclude that the
whole idea of a FSM is just the wrong idea. Remembering anything at
all about most pages may just be unnecessary in a world where we fix
everything. Maybe the end result is that the common case is that most
individual heap pages are "closed". Perhaps that can just be implicit
-- there is no information to remember.

would be quite natural for the leader backend to say to each individual
follower backend: thank you for waiting for me, here is a big contiguous
range of heap pages (e.g., as many as 50 heap pages or more per xact when
things truly ramp up), which should be enough to allow you to avoid going to
the FSM again for quite a long time.

That'd end up badly if we did for a relation that's hotly inserted into by a
lot of connections, but where each connection only inserts a few rows. I
suspect we'd either need information from the inserting backends about the
number of pages they're likely going to need (e.g. a smarter "extension queue"
where each entry lists the number of pages requested) and/or stats about the
number of consecutive insertions a relation typically gets.

That's a part of it, certainly. The heuristics would mostly be driven
by recent information. In particular, we would probably only ramp
things up based on a clear observed failure to keep up with demand. We
could be very aggressive indeed this way, without too much extra risk.
We'd only ramp up because we *knew* that the most recent batch/set of
free lists for the last bulk request was inadequate.

For that you need the context, the recent history.

But the leader backend doesn't do that at all -- it just puts all of
the pages (up to 512 total) in the FSM. All of the waiting follower
backends are left to senselessly fight it out for free space by simply
going back to the FSM when they wake up again. Why can't they
cooperate intelligently? What stops that today?

The lack of suitable in-memory coordination facilities.

It turns out that I looked a bunch at this area because of AIO + DIO: With DIO
the typical linux filesystems (ext4, xfs) do not perform any delayed
allocation, which means that the non-bulk extension leads to *horrible*
filesystem fragmentation.

It's the coordination facilities -- agreed. The only way that I seem
to be going further than what you've said here is the logical database
stuff. That is, I intend to specifically involve metadata like XIDs,
as well as a general understanding of things like commit/abort
boundaries for said XIDs. Ownership of a free space lease shouldn't be
strictly tied to an XID, but that seems like a good starting point to
me. This is the kind of thing that an OS engineer would never think
of, that other database systems seem to do a lot of.

This design involving waiting on XIDs could also be reused for nbtree
page deletion, where page recycling is deferred (same with GiST page
deletion). We could model this problem as: the
physically-deleted-by-VACUUM page's safexid value should be treated as
"the maybe-still-ongoing transaction that owns this page". This is
kind of a lie, but close enough to the truth. That way we can put
every deleted page in the FSM/whatever immediately, while reusing the
same XID-wait stuff to solve the recycle safety problem.

We push the recycle safety stuff on to the consumer side this way.
This is currently a problem that mostly belongs to VACUUM itself (the
producer side) -- which makes zero sense. I think that the only reason
we do it that way is because the current FSM doesn't care at all about
XID boundaries. It's almost the same problem, maybe even exactly the
same.

I addressed (or rather sidestepped) this to a small degree in the AIO+DIO
patchset, by making extension faster and doing a few improvements around
locking. But it's not good enough.

If the relation extension infrastructure instead had information about the
number of blocks each waiter wants and we had Thomas' "accurate shared
relation size" infrastructure, we could do a lot better:

Sounds great to me!

1) Whenever a backend needs new space and most of the previously bulk-extended
space is used opportunistically try to bulk-extend. In the happy path this
avoids the thundering herd issue because other backends can continue
inserting while extension is in progress.

Right -- my thoughts exactly.

2) Whenever bulk extending, release pages to individual backends by looking in
the queue of backends needing extension and wake up backends individually
whenever we've bulk extended sufficiently for that backend.

This is also exactly what I was thinking of, albeit for slightly
different reasons.

I want to more or less say directly to each backend at the point it
wakes up: here you go, here are those pages you asked for. Somebody
might check in about this free space lease later on, but in general
you can think of these pages as your exclusive property.

Making it backend/XID specific is probably helpful in many ways. Like
maybe there is a natural variation in demand among backends, even
though they're all bulk inserting into the same table. Who knows?

I think the fundamental issue of bulk extension issue is more that bulk
extension shouldn't actually get pages from the FSM at all. That's just an
ugly implementation detail of the hack we have currently, rather than
something that should live on.

Actually, I agree. I was just trying to use our existing terms to
introduce my ideas.

We may want to decide to continue inserting
bulk extended pages into the FSM to handle edge cases of backends erroring out
before space is used, but the happy path should not touch the FSM at all.

You don't have to touch the FSM if you don't even have one in the
first place. :-)

It sounds like you're thinking of using some global history, or at least a
backend local history? I think it might be better to have backend local,
per-command state. E.g. for a COPY aggressively increase the amount of space
reserved, but not for a single-row insert.

That seems quite possible. Not sure just yet.

This feels like the right general idea to me, even though many of the
details are not worked out.

I guess it depends on what you define as recent, but I don't see a way to
bound the timeframe usefully. We need to deal with cases where table sizes go
up and down over time, e.g. because there's periodic batch activity and
ongoing activity. What's the time bound for unused space somewhere in a
relation? I don't see any, unless we add compaction logic to vacuum.

I don't want to bound the timeframe, exactly. I want to look at what
happened during the last bulk operation, and compare that to what I
see for the current/ongoing operation -- which could involve a fair
amount of details. The idea is to make decisions based on inferences
about what is *changing*. What's working, what's not working -- the
*direction* of things is at least as important as the state of things
at this exact point in time.

This does mean that you have to accept that some number of heap pages
can be "leaked" if inserts against a table that was being
bulk-extended suddenly go away completely. Of course it should be
possible to get this so-called leaked space back, just as soon as the
table receives more inserts (any inserts should do that). My current
best guess is that this will work fine in practice.

--
Peter Geoghegan

#10

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#6)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 9:48 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't know what the right degree of stickiness is, but I can easily
believe that we want to have more than none. Part of Peter's point, at
least as I understand it, is that if a page has 100 tuples and you
delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
created with 1 or 2 or 3 new tuples. You should instead hope that
space will optimize future HOT updates, or else wait for the page to
lose some larger number of tuples and then fill up all the space at
the same time with tuples that are, hopefully, related to each other
in some useful way.

That is a big part of it. But it gets worse than that. It is truly
insidious, once you follow the chain of problems over time -- maybe
over hours and hours. BenchmarkSQL makes this a lot easier.

Right now when we set heap fill factor to 85 (which is what
BenchmarkSQL does), we effectively have this chunk of special reserved
free space left behind on each page. The remaining 15% of tuple space
is of course reserved for successor versions of the same logical rows
-- those rows from the time that the page first fills (presumably they
all come from straight inserts). That's our starting point -- simple
enough.

The approach we take effectively gives up on the *whole entire page*
at literally the first sign of trouble -- the first time we fail to
hold a successor version on that same page. Even though it just well
have been a once-off perturbation. What we might call bad luck, as
opposed to an inevitability. For example, maybe ANALYZE ran that
second, which made the crucial difference for that one heap page that
one time. That's kind of enough to make the implementation assume all
bets are off, for the entire page -- it's as if we pessimistically
assume that there is no chance of keeping anything like the original
rows on the heap page, so we might as well let it all get mixed up
even more. This is way too pessimistic -- it's black and white
thinking.

While in general that level of pessimism might turn out to be
justified (anything is possible), we should hold the line and find out
for sure how bad it is. I happen to know that our problematic
BenchmarkSQL/TPC-C tables just don't have any skew, so this pessimism
is particularly wrong-headed there.

Bear in mind that the new unrelated tuples that ultimately replace our
original couldn't-fit tuples (probably after the next VACUUM) are
probably not "average tuples". Maybe they're newly inserted tuples,
which is bad in the BenchmarkSQL case, because they're all
*guaranteed* to be updated in the future (owing to the details of the
workload). With some other typical workload they might often be
successor versions from updates instead -- most likely for rows that
are disproportionately likely to be updated many times. It's truly
pernicious.

If we only accepted the original row migration, and didn't throw good
money after bad, then the cycle would be broken. To me this is a bit
like certain heuristic algorithms, where you technically accept a
suboptimal outcome to avoid getting stuck in a local maxima.

Now what's the threshold? 20 out of 100? 50? 80?

I'm not going to pretend to know the answer. But I will point out that
one DB system whose heap fill factor defaults to 90 seems to have a
symmetric setting for the "open up page again" point -- the default
for that is 40. Not sure that that really matters to us, but that does
seem pretty low to me. It's very sticky indeed.

Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
much value for its own sake. But it might be useful as a way of
accomplishing some other goal. For example if we decided that we need
a bit per page to track "open" vs. "closed" status or something of
that sort, I don't think we'd lose much by having fewer categories.

This is probably one of my less important points. I addressed this
point in my remarks on this to Andres from earlier.

I don't have a well-formed opinion on this point yet. I am also
skeptical of the idea of merging the FSM and VM, because it seems to
me that which pages are all-visible and which pages are full are two
different things.

As I clarified in my email to Andres, this was just a bad choice of
words. I meant that they're at least conceptually much closer,
provided you're willing to accept my general view of things.

However, there's something to Peter's point too. An
empty page is all-visible but still valid as an insertion target, but
this is not as much of a contradiction to the idea of merging them as
it might seem, because marking the empty page as all-visible is not
nearly as useful as marking a full page all-visible. An index-only
scan can't care about the all-visible status of a page that doesn't
contain any tuples, and we're also likely to make the page non-empty
in the near future, in which case the work we did to set the
all-visible bit and log the change was wasted. The only thing we're
accomplishing by making the page all-visible is letting VACUUM skip
it, which will work out to a win if it stays empty for multiple VACUUM
cycles, but not otherwise.

That's a big part of it. The other thing is systemic effects, which is
related to the part of this email above about chain of events. I'll
talk about that some more now.

Lets say that VACUUM puts a non-zero usable amount of free space in
the FSM for a page that it marks all-visible/all-frozen at the same
time -- this is possible today, of course. To me this seems
contradictory, at least when there isn't much space -- which I believe
is common. It's at least a minor waste to set the VM bit.

Now let's say we change VACUUM to not waste effort on setting the heap
page -- we're now avoiding minor waste, which is an okay goal. But
we're also doing something else, that I find very objectionable: we're
effectively betting that the situation will improve for the same page
at some point in the future, during a future VACUUM. It is possible
that we could get just one or two inserts on the same page, and then
see it again, and then we're done. If that happens, we win our bet.
Otherwise, we lose our bet.

Here's the problem with making that bet: it's typically an awful bet.
The potential upside is clearly pretty low. And more importantly,
there is a decent chance that we'll actually get some row that is
going to mess everything up -- not an "average row" (whatever that may
mean). Again, very frequently updated rows are often what'll end up on
the page, which are enough to make the page never have its all-visible
bit set.

Here is why I find this *extremely* objectionable: we *actually* risk
experiencing a downside for the heap page by taking this awful bet
*many many times* in the future -- it's not just a once-off downside,
I believe. This is not in fact a choice about one heap page on one
occasion -- it is actually a *policy* affecting the page again and
again (and every other heap page too). It is only by making a
not-quite-physically-full page "closed" in the way that I advocate we
avoid taking this awful bet -- that is how it's possible to artificially *make*
this a question about one page on one occasion. We're bounding the
downside, lessening our exposure to the worst case. Intuitively, I
think that the possible downside is essentially unlimited, if we
assume systemic destabilization risks (which seems wise). Whereas the
possible upside of taking that awful bet is clearly, measurably small.
A once-off upside. Peanuts, really.

(Of course my argument won't work if you assume that even with the
"closed" page choice we naturally get more updates to one of the aged
frozen rows anyway. But we're counting on that not happening in either
scenario/with *either* strategy, so I don't think that it invalidates
anything I've said.)

Conceptually, you might think of the merged data structure as a
"quiescence map." Pages that aren't being actively updated are
probably all-visible and are probably not great insertion targets.
Those that are being actively updated are probably not all-visible and
have better chances of being good insertion targets. However, there
are a lot of gremlins buried in the word "probably." A page could get
full but still not be all-visible for a long time, due to long-running
transactions, and it doesn't seem likely to me that we can just ignore
that possibility.

I think that we should opportunistically be setting the all-frozen bit
in heap pages (and freezing tuples earlier than required) anyway. For
example, we could notice that a whole heap page is eligible to be
marked all-frozen, and freeze the tuples early -- we could restart
processing of the page once we realized that freezing early is a good
idea.

Why even have a distinct all-visible VM bit? The per-tuple overhead of
the WAL records for freezing are kind of bulky, but that could be
fixed. It's not a good reason. (Of course I'm not arguing that we
should break pg_upgrade, just that there is no reason not to freeze
early when it's cheap to do so in passing.)

+1. I had this same thought reading Peter's email. I'm not sure if it
makes sense to hook that into Thomas's work, but I think it makes a
ton of sense to use shared memory to coordinate transient state like
"hey, right now I'm inserting into this block, you guys leave it
alone" while using the disk for durable state like "here's how much
space this block has left."

I agree that that's the best starting point for this.

--
Peter Geoghegan

#11

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Bruce Momjian (#7)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote:

OK, I am trying to think of something simple we could test to see the
benefit, with few downsides. I assume the case you are considering is
that you have a 10 8k-page table, and one page is 80% full and the
others are 81% full, and if several backends start adding rows at the
same time, they will all choose the 80%-full page.

That's one problem (there is so many). They start with that same page,
and then fight each other over other pages again and again. This
emergent behavior is a consequence of the heuristic that has the FSM
look for the heap page with the least space that still satisfies a
given request. I mentioned this problem earlier.

For example:

1. page with most freespace is 95% free
2. pages 2,4,6,8,10 have between 86%-95% free
3. five pages
4. proc id 14293 % 5 = 3 so use the third page from #2, page 6

Right now my main concern is giving backends/XIDs their own space, in
bulk. Mostly for concurrent bulk inserts, just so we can say that
we're getting the easy cases right -- that is a good starting point
for my prototype to target, I think. But I am also thinking about
stuff like this, as one way to address contention.

This should spread out page usage to be more even, but still favor pages
with more freespace. Yes, this is simplistic, but it would seem to have
few downsides and I would be interested to see how much it helps.

I thought about having whole free lists that were owned by XIDs, as
well as shared free lists that are determined by hashing MyProcPid --
or something like that. So I agree that it might very well make
sense.

--
Peter Geoghegan

#12

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: John Naylor (#8)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 12:11 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I'm not sure it's essential to have "far" fewer categories, if the closed-to-open transition is made less granular through some other mechanism.

See my remarks to Andres about FSM_CATEGORIES from earlier. (It may
not matter if you don't believe that it helps us to have such high
granularity -- that means it won't hurt to get rid of it, which might
be a good enough reason on its own.)

I now wonder if the FSM is fundamentally doing the wrong thing by
keeping track of all "free space" in every page over time. Why
wouldn't we just have a free space management strategy that is
generally only concerned with recent events?

The second paragraph here is an interesting idea and makes a great deal of sense. It would lead to smaller FSMs that are navigated more quickly and locked for shorter durations.

Most importantly, the FSM/whatever doesn't have to be locked at all if
you don't go to it. And so going to the FSM much less often with space
leases or whatever seems like it might be a big win for that reason
(not just the locality stuff that I care about).

Implicit "closure" seems riskier in my view if you want to bring VM qualities into it, however.

I didn't mean to imply that we'd be changing that data structure. As I
clarified earlier, "merging the FSM and the VM" was meant in a
conceptual way. Not the best choice of words on my part.

This makes sense as well. Shared memory for more recent / highly contended state, and disk for less recent / less contended / stickier state.

I would like to also make it WAL-logged, which is easier this way.
That isn't my main goal, but why not do it? The way that the FSM works
right now within VACUUM REDO routines is a problem.

--
Peter Geoghegan

#13

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#9)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 6:11 PM Peter Geoghegan <pg@bowt.ie> wrote:

Again, it's all about *systemic* effects. A FSM request is basically a
choice *among* blocks that could in principle satisfy the request --
often the number of basically-satisfactory blocks is huge (way too
large, even). You have to bear in mind that concurrent requests are in
competition with each other in much of the time. They are usually all
looking for exactly the same thing (e.g., same free space
requirement), or close to the same thing -- tuples within a given
table tend to all be about as wide as each other.

I can clearly sometimes see very high contention cases, where backends
do way too much spinning inside the RelationGetBufferForTuple() loop.
They're all more or less chasing the same scraps of free space, in a
highly inefficient way. Even though there is probably an abundance of
free space. Right now the heap pages that have the most free space are
also the ones that are least likely to be used.

Though I think that these backends should cooperate more, some amount
of competition is probably necessary. If FSM_CATEGORIES used 3 bits
instead of 8, then the backends could fall back on caring about some
other thing that distinguished heap pages from each other that
actually matters. Leading to less contention, and maybe better space
utilization.

To me, it feels like the real issue here is that the free space map is
completely deterministic. As it's currently designed, you can reduce
the number of bits as much as you want, and it doesn't change
anything. Concurrent requests get the same answer, which is not what
we want. Just as a thought experiment, imagine that backends
preferentially prefer pages whose block numbers are congruent to their
PID module some integer. Now you have a reason - perhaps not the best
possible reason, but *a* reason - for every single backend to pounce
on the same target page.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#10)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 8:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

Lets say that VACUUM puts a non-zero usable amount of free space in
the FSM for a page that it marks all-visible/all-frozen at the same
time -- this is possible today, of course. To me this seems
contradictory, at least when there isn't much space -- which I believe
is common. It's at least a minor waste to set the VM bit.

It seems to me that you are talking about the VM bit but making an
argument that is more applicable to the FSM. If we think that the
small amount of freespace in the page is going to result in the page
being dirtied repeatedly as it cycles through a bunch of different
tuples none of which stick around long term, then we might not want to
advertise the freespace in the FSM, or we might want the FSM to
disregard the fact that a small amount of freespace is present there.
However, if we do that, then we certainly want to also set the VM bit.
Otherwise, we've got a page that we've effectively quiesced, by
refusing to reuse the freespace, but the VM doesn't know that, so
future index-only scans and VACUUM operations are penalized.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#13)

Re: The Free Space Map: Problems and Opportunities

On Wed, Aug 18, 2021 at 7:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

Though I think that these backends should cooperate more, some amount
of competition is probably necessary. If FSM_CATEGORIES used 3 bits
instead of 8, then the backends could fall back on caring about some
other thing that distinguished heap pages from each other that
actually matters. Leading to less contention, and maybe better space
utilization.

To me, it feels like the real issue here is that the free space map is
completely deterministic.

I agree. Though note the separate problem with FSM_CATEGORIES that I
mentioned to Andres: if you want to maintain the FSM value for pages
eagerly within backends, then practically speaking a far lower
FSM_CATEGORIES-like constant seems necessary.

I'm really just saying that the fine granularity seems to me to be
basically the wrong approach. In general the single most important
thing is the amount of free space available -- we agree on that.
However, the only reason to remember very small *differences* in free
space between heap pages is because you intend to do *something* with
that extra information. But caring about noise-level differences seems
inherently bad -- there is bound to be a better fall-back behavior
that reduces contention, and actually *increases* space utilization in
practice (even though theoretically coarser-grained information might
hurt utilization a little).

As it's currently designed, you can reduce
the number of bits as much as you want, and it doesn't change
anything. Concurrent requests get the same answer, which is not what
we want.

I agree that reducing FSM_CATEGORIES alone will not buy us anything.

--
Peter Geoghegan

#16

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#14)

Re: The Free Space Map: Problems and Opportunities

On Wed, Aug 18, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 17, 2021 at 8:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

Lets say that VACUUM puts a non-zero usable amount of free space in
the FSM for a page that it marks all-visible/all-frozen at the same
time -- this is possible today, of course. To me this seems
contradictory, at least when there isn't much space -- which I believe
is common. It's at least a minor waste to set the VM bit.

It seems to me that you are talking about the VM bit but making an
argument that is more applicable to the FSM. If we think that the
small amount of freespace in the page is going to result in the page
being dirtied repeatedly as it cycles through a bunch of different
tuples none of which stick around long term, then we might not want to
advertise the freespace in the FSM, or we might want the FSM to
disregard the fact that a small amount of freespace is present there.

I'm not quite sure if my argument applies to the VM or the FSM. It's a
bit tricky to talk about because I'm imagining a world in which the
concepts have shifted. And because the exact details of the data
structures are less clear (even if they were clear to me, the lack of
a shared vocabulary might confuse our discussion).

Maybe there isn't even a conventional FSM in this new world. Maybe
free space management works by focusing on recent events, and on
outcomes. This means that we store much less information about the
entire database, and much more information about very recent events.
In particular, I hope that free space management won't have to care
about how much free space most individual heap pages have. Perhaps
most heap pages are simply "closed", unavailable for reuse.

However, if we do that, then we certainly want to also set the VM bit.
Otherwise, we've got a page that we've effectively quiesced, by
refusing to reuse the freespace, but the VM doesn't know that, so
future index-only scans and VACUUM operations are penalized.

I agree that if we do that we'd want to also set the VM bit.

I'm imagining a world in which a "closed page" is synonymous with an
all-visible/all-frozen page. That doesn't mean that they're exactly
the same thing. For example, maybe a closed page is a page that is at
least *expected* to have its all-visible/all-frozen set by VACUUM the
next time it runs -- though more often it'll have been set already.

If you're looking at the system in real time with this improved
design, you'll see that pages that become closed will then also have
their VM bit set on a predictable timeline. Maybe we even set the VM
bit when some of the closed page's closed rows happen to be updated
(Freeze cutoff permitting) -- there isn't that much downside to VACUUM
betting that these "closed row updates" will be the last such updates
forever, or at least for a long time. This should at least be the case
once we add an optimized approach to freezing whole pages (pages whose
tuples start out unfrozen).

--
Peter Geoghegan

#17

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#10)

Re: The Free Space Map: Problems and Opportunities

On Tue, Aug 17, 2021 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

Now what's the threshold? 20 out of 100? 50? 80?

I'm not going to pretend to know the answer. But I will point out that
one DB system whose heap fill factor defaults to 90 seems to have a
symmetric setting for the "open up page again" point -- the default
for that is 40. Not sure that that really matters to us, but that does
seem pretty low to me. It's very sticky indeed.

Correction: it's actually 60, not 40.

It's true that the actual default is 40, but it works the other way
around relative to Postgres (as does the related fill factor like
setting, which defaults to 90 instead of 100). And so we would think
of this other "open up closed page once again" setting as having a
default of 60. (Or perhaps we'd think of it as having a default that
is 2/3 of the closely related fill factor setting's default.)

--
Peter Geoghegan

#18

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#17)

Re: The Free Space Map: Problems and Opportunities

On Wed, Aug 18, 2021 at 3:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 17, 2021 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

Now what's the threshold? 20 out of 100? 50? 80?

I'm not going to pretend to know the answer. But I will point out that
one DB system whose heap fill factor defaults to 90 seems to have a
symmetric setting for the "open up page again" point -- the default
for that is 40. Not sure that that really matters to us, but that does
seem pretty low to me. It's very sticky indeed.

Correction: it's actually 60, not 40.

It's true that the actual default is 40, but it works the other way
around relative to Postgres (as does the related fill factor like
setting, which defaults to 90 instead of 100). And so we would think
of this other "open up closed page once again" setting as having a
default of 60. (Or perhaps we'd think of it as having a default that
is 2/3 of the closely related fill factor setting's default.)

I don't know whether 60 is optimal or not, but it doesn't seem crazy.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#16)

Re: The Free Space Map: Problems and Opportunities

On Wed, Aug 18, 2021 at 1:05 PM Peter Geoghegan <pg@bowt.ie> wrote:

Maybe there isn't even a conventional FSM in this new world. Maybe
free space management works by focusing on recent events, and on
outcomes. This means that we store much less information about the
entire database, and much more information about very recent events.
In particular, I hope that free space management won't have to care
about how much free space most individual heap pages have. Perhaps
most heap pages are simply "closed", unavailable for reuse.

I very much doubt that you can get away without some sort of free
space map. Even if in most cases most pages are closed to insertions,
there will be important corner cases where lots of pages are open for
insertions, like when you just deleted a ton of rows and then ran
VACUUM. And we cannot lose track of even one of those open pages or,
if the pre-8.4 state of the world is any indication, we will be super
sad. I suppose technically it doesn't need to be a map; it could be
structured some other way. But it is going to have to be able to keep
track of an arbitrary number of open pages without losing any, and a
map is one good way of doing that.

Regarding the FSM_CATEGORIES question, I think the issue is big tuples
vs. small tuples. You don't want to keep on finding pages with some
free space and then find out that they don't actually have enough for
the tuple you want to insert. Now that does become less of an issue
with this open/closed system because, at least initially, when a page
is re-opened to inserts, more than a quarter of the page will be free
(I presume) and so the details don't matter. But, as it gets closer to
filling up, it might matter again. Perhaps we don't want to waste time
looking through a bunch of pages with 1kB free when we're trying to
insert a 2kB tuple. We could close all those pages as we try them, to
prevent repeating the same process over and over again, but maybe the
2kB tuple we're now trying to insert is unusual, and most tuples are
40 bytes. Then we look silly.

One idea might be to have a CLOSED state in the FSM and then a variety
of OPEN states that are distinguished by amount of available free
space. For example, imagine something like CLOSED, OPEN (>2kb), OPEN
(1.5-2kb), OPEN (1-1.5kB), OPEN (768-1023 byte), OPEN (512-767 bytes),
OPEN (256-511 bytes), OPEN (128-255 bytes), OPEN (64-127 bytes), OPEN
(<64 bytes). I think that would give us enough information to pack
open pages tightly full before having them become closed. I don't know
if those are exactly the right boundaries, and 10 categories might be
worse than 8 or 16, but I think it's likely correct to suppose that
(a) we don't really care at all how much space is present in closed
pages, and (b) for open pages, exactitude is most important when the
amount of available space is small.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#18)

Re: The Free Space Map: Problems and Opportunities

On Fri, Aug 20, 2021 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't know whether 60 is optimal or not, but it doesn't seem crazy.

Uh, I had it right the first time. Only the fill factor setting is
"inverted relative to Postgres". This other setting really does
default to 40. So it's very low.

--
Peter Geoghegan

#21

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#20)

#22

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#19)

#23