Unhappy about API changes in the no-fsm-for-small-rels patch

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#1)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

I'm kinda thinking that this is the wrong architecture.

The bits of that patch that I've looked at seemed like a mess
to me too. AFAICT, it's trying to use a single global "map"
for all relations (strike 1) without any clear tracking of
which relation the map currently describes (strike 2).
This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is, and in any case it seems pretty
inefficient for workloads that insert into multiple tables.

I'd have expected any such map to be per-table and be stored in
the relcache.

regards, tom lane

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#2)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-16 14:31:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'm kinda thinking that this is the wrong architecture.

The bits of that patch that I've looked at seemed like a mess
to me too. AFAICT, it's trying to use a single global "map"
for all relations (strike 1) without any clear tracking of
which relation the map currently describes (strike 2).

Well, strike 2 basically is not a problem right now, because the map is
cleared whenever a search for a target buffer succeeded. But that has
pretty obvious efficiency issues...

This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is

I think it better needs to be fail-soft independent of this the no-fsm
patch. Because the fsm is not WAL logged etc, it's pretty easy to get a
pretty corrupted version. And we better deal with that.

and in any case it seems pretty inefficient for workloads that insert
into multiple tables.

As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.

I'd have expected any such map to be per-table and be stored in
the relcache.

Same.

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#3)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

On 2019-04-16 14:31:25 -0400, Tom Lane wrote:

This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is

I think it better needs to be fail-soft independent of this the no-fsm
patch. Because the fsm is not WAL logged etc, it's pretty easy to get a
pretty corrupted version. And we better deal with that.

Yes, FSM has to be fail-soft from a *correctness* viewpoint; but it's
not fail-soft from a *performance* viewpoint. It can take awhile for
us to self-heal a busted map. And this fake map spends almost all its
time busted and in need of (expensive) corrections. I think this may
actually be the same performance complaint you're making, in different
words.

regards, tom lane

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#4)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.
extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+                    bool check_fsm_only);
So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.

When first looking for free space, it's "false": Within
GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
returns invalid.

If we have to extend, after acquiring the lock to extend the relation,
we call GetPageWithFreeSpace() again to see if another backend already
extended while waiting on the lock. If there's no FSM, the thinking
is, it's not worth it to get the number of blocks again.

@@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
* Note that if the new spaceAvail value is higher than the old value stored
* in the FSM, the space might not become visible to searchers until the next
* FreeSpaceMapVacuum call, which updates the upper level pages.
+ *
+ * Callers have no need for a local map.
*/
void
-RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
+RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
+                       Size spaceAvail, BlockNumber nblocks)
There's no explanation as to what that "nblocks" argument is. One
basically has to search other callers to figure it out. It's not even
clear to which fork it relates to. Nor that one can set it to
InvalidBlockNumber if one doesn't have the relation size conveniently
reachable. But it's not exposed to RecordAndGetPageWithFreeSpace(), for
a basically unexplained reason. There's a comment above
fsm_allow_writes() - but that's file-local function that external
callers basically have need to know about.

Okay.

I can't figure out what "Callers have no need for a local map." is
supposed to mean.

It was meant to contrast with [RecordAnd]GetPageWithFreeSpace(), but I
see how it's confusing.

+/*
+ * Clear the local map.  We must call this when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ */
+void
+FSMClearLocalMap(void)
+{
+   if (FSM_LOCAL_MAP_EXISTS)
+   {
+       fsm_local_map.nblocks = 0;
+       memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
+              sizeof(fsm_local_map.map));
+   }
+}
+
So now there's a new function one needs to call after successfully using
the block returned by [RecordAnd]GetPageWithFreeSpace(). But it's not
referenced from those functions, so basically one has to just know that.

Right.

+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4
Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.

Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.

I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.
+static void
+fsm_local_set(Relation rel, BlockNumber cur_nblocks)
+{
+   BlockNumber blkno,
+               cached_target_block;
+
+   /* The local map must not be set already. */
+   Assert(!FSM_LOCAL_MAP_EXISTS);
+
+   /*
+    * Starting at the current last block in the relation and working
+    * backwards, mark alternating blocks as available.
+    */
+   blkno = cur_nblocks - 1;
That comment explains very little about why this is done, and why it's a
good idea.

Short answer: performance -- it's too expensive to try every block.
The explanation is in storage/freespace/README -- maybe that should be
referenced here?

+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL        0x01
+/* Local map of block numbers for small heaps with no FSM. */
+typedef struct
+{
+   BlockNumber nblocks;
+   uint8       map[HEAP_FSM_CREATION_THRESHOLD];
+}          FSMLocalMap;
+
Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.

I fail to see the advantage of that.

+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+                BlockNumber nblocks, BlockNumber *get_nblocks)
+       if (rel->rd_rel->relpages != InvalidBlockNumber &&
+           rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
+           return true;
+       else
+           skip_get_nblocks = false;
+   }
This badly needs a comment explaining that these values can be basically
arbitrarily out of date. Explaining why it's correct to rely on them
anyway (Presumably because creating an fsm unnecessarily is ok, it just
avoid susing this optimization).

Agreed, and yes, your presumption is what I had in mind.

I'm kinda thinking that this is the wrong architecture.

1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation.

That was true previously anyway if the FSM returned InvalidBlockNumber.

That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.

Could you explain why it would succeed less often?

2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().

Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.

3) The global variable based approach means that we cannot easily do
better. Even if we had a cache that lives across
RelationGetBufferForTuple() calls.

4) fsm_allow_writes() does a smgrexists() for the FSM in some
cases. That's pretty darn expensive if it's already open.

(from earlier)

Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?

Hmm, I can look into that.

On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-16 14:31:25 -0400, Tom Lane wrote:

<snip>

and in any case it seems pretty inefficient for workloads that insert
into multiple tables.

As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.

Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[1]: /messages/by-id/CAD__Oui5+qiVxJSJqiXq2jA60QV8PKxrZA8_W+cCxROGAFJMWA@mail.gmail.com

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#5)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 17, 2019 at 10:39 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:

On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4
Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.
Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.

I think we also need to take care of TPD pages along with meta page.
This might be less effective if we encounter TPD pages as well in
small relation which shouldn't be a common scenario, but it won't
hurt, otherwise. Those pages are anyway temporary and will be
removed.

BTW there is one other thing which is tied to heap in FSM for which we
might want some handling.
#define MaxFSMRequestSize MaxHeapTupleSize

In general, it will be good if we find some pluggable way for both the
defines, otherwise, also, it shouldn't cause a big problem.

I'm kinda thinking that this is the wrong architecture.

1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation.

That was true previously anyway if the FSM returned InvalidBlockNumber.

That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.

Could you explain why it would succeed less often?

2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().

Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.

3) The global variable based approach means that we cannot easily do
better. Even if we had a cache that lives across
RelationGetBufferForTuple() calls.

4) fsm_allow_writes() does a smgrexists() for the FSM in some
cases. That's pretty darn expensive if it's already open.

(from earlier)

Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?

Hmm, I can look into that.

I think if we want to keep something like this feature, we'd need to
cache the relation size in a variable similar to how we cache the FSM
size (like SMgrRelationData.smgr_fsm_nblocks)

makes sense. I think we should do this unless we face any problem with it.

*and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache.

I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation. We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation. The current mechanism addresses both this and normal case
where there are not many deletions. Sure there is some overhead of
building the map again and rechecking each page. The first one is a
memory operation and takes a few cycles and for the second we
optimized by checking the pages alternatively which means we won't
check more than two pages at-a-time. This cost is paid by not
checking FSM and it could be somewhat better in some cases [1]/messages/by-id/CAD__Oui5+qiVxJSJqiXq2jA60QV8PKxrZA8_W+cCxROGAFJMWA@mail.gmail.com.

If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?

Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.

On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-16 14:31:25 -0400, Tom Lane wrote:

<snip>

and in any case it seems pretty inefficient for workloads that insert
into multiple tables.

As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.

During development, we were also worried about the performance
regression that can happen due to this patch and we have done many
rounds of performance tests where such a cache could be accessed
pretty frequently. In the original version, we do see a small
regression as a result of which we came up with an alternate strategy
of not checking every page. If you want, I can share the links of
emails for performance testing.

Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.

In fact, we have seen some wins. See the performance testing done [1]/messages/by-id/CAD__Oui5+qiVxJSJqiXq2jA60QV8PKxrZA8_W+cCxROGAFJMWA@mail.gmail.com
with various approaches during development.

Added this as an open item.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

andres@anarazel.de

over 6 years ago

In reply to: John Naylor (#5)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-17 13:09:05 +0800, John Naylor wrote:

On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.
extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+                    bool check_fsm_only);
So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.
When first looking for free space, it's "false": Within
GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
returns invalid.

If we have to extend, after acquiring the lock to extend the relation,
we call GetPageWithFreeSpace() again to see if another backend already
extended while waiting on the lock. If there's no FSM, the thinking
is, it's not worth it to get the number of blocks again.

I can get that (after reading through the code, grepping through all
callers, etc), but it means that every callsite needs to understand
that. That's making the API more complicated than needed, especially
when we're going to grow more callers.

+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4
Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.
Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.

My complaint is basically that it's apparently AM specific (we don't use
the logic for e.g. indexes), and that the name suggest it's specific to
heap. And it's not controllable by the outside, which means it can't be
tuned for the specific usecase.

I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.
+static void
+fsm_local_set(Relation rel, BlockNumber cur_nblocks)
+{
+   BlockNumber blkno,
+               cached_target_block;
+
+   /* The local map must not be set already. */
+   Assert(!FSM_LOCAL_MAP_EXISTS);
+
+   /*
+    * Starting at the current last block in the relation and working
+    * backwards, mark alternating blocks as available.
+    */
+   blkno = cur_nblocks - 1;
That comment explains very little about why this is done, and why it's a
good idea.
Short answer: performance -- it's too expensive to try every block.
The explanation is in storage/freespace/README -- maybe that should be
referenced here?

Yes. Even just adding "for performance reasons, only try every second
block. See also the README" would be good.

But I'll note that the need to this - and potentially waste space -
counters your claim that there's no performance considerations with this
patch.

+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL        0x01
+/* Local map of block numbers for small heaps with no FSM. */
+typedef struct
+{
+   BlockNumber nblocks;
+   uint8       map[HEAP_FSM_CREATION_THRESHOLD];
+}          FSMLocalMap;
+
Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.
I fail to see the advantage of that.

It'd allow different AMs to have different numbers of dont-create-fsm
thresholds without needing additional memory (up to 32 blocks).

I'm kinda thinking that this is the wrong architecture.

1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation.

That was true previously anyway if the FSM returned InvalidBlockNumber.

True. That was already pretty annoying though.

That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.

Could you explain why it would succeed less often?

Two aspects: 1) If more backends access a table, there'll be a higher
chance the target page is full 2) There's more backends that don't have
a target page.

2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().

Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.

It's pretty common to have small tables that are modified by a number of
backends. A typical case is tables that implement locks for external
processes and such.

On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-16 14:31:25 -0400, Tom Lane wrote:

<snip>

and in any case it seems pretty inefficient for workloads that insert
into multiple tables.

As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.

Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.

Well, try a few hundred relations on nfs (where stat is much more
expensive). Or just pgbench a concurrent workload with a few tables with
one live row each, updated by backends (to simulate lock tables and
such).

But also, my concern here is to a significant degree architectural,
rather than already measurable performance regressions. We ought to work
towards eliminating unnecessary syscalls, not the opposite.

Greetings,

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Amit Kapila (#6)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:

*and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache.

I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation. We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation. The current mechanism addresses both this and normal case
where there are not many deletions. Sure there is some overhead of
building the map again and rechecking each page. The first one is a
memory operation and takes a few cycles

Yea, I think creating / resetting the map is basically free.

I'm not sure I buy the concurrency issue - ISTM it'd be perfectly
reasonable to cache the local map (in the relcache) and use it for local
FSM queries, and rebuild it from scratch once no space is found. That'd
avoid a lot of repeated checking of relation size for small, but
commonly changed relations. Add a pre-check of smgr_fsm_nblocks (if >
0, there has to be an fsm), and there should be fewer syscalls.

and for the second we optimized by checking the pages alternatively
which means we won't check more than two pages at-a-time. This cost
is paid by not checking FSM and it could be somewhat better in some
cases [1].

Well, it's also paid by potentially higher bloat, because the
intermediate pages aren't tested.

If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?

Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.

ISTM normal invalidation logic should just take of that kind of thing.

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.

But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#8)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.

But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.

Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure? It's not like there's not anything else on our
plates to fix before beta.

regards, tom lane

#10

amit.kapila16@gmail.com

over 6 years ago

In reply to: Andres Freund (#8)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 17, 2019 at 9:46 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:

*and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache.

I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation. We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation. The current mechanism addresses both this and normal case
where there are not many deletions. Sure there is some overhead of
building the map again and rechecking each page. The first one is a
memory operation and takes a few cycles

Yea, I think creating / resetting the map is basically free.

I'm not sure I buy the concurrency issue - ISTM it'd be perfectly
reasonable to cache the local map (in the relcache) and use it for local
FSM queries, and rebuild it from scratch once no space is found. That'd
avoid a lot of repeated checking of relation size for small, but
commonly changed relations.

Okay, so you mean to say that we need to perform additional system
call (to get a number of blocks) only when no space is found in the
existing set of pages? I think that is a fair point, but can't we
achieve that by updating relpages in relation after a call to
RelationGetNumberofBlocks?

Add a pre-check of smgr_fsm_nblocks (if >
0, there has to be an fsm), and there should be fewer syscalls.

Yes, that check is a good one and I see that we already do this check
in fsm code before calling smgrexists.

and for the second we optimized by checking the pages alternatively
which means we won't check more than two pages at-a-time. This cost
is paid by not checking FSM and it could be somewhat better in some
cases [1].

Well, it's also paid by potentially higher bloat, because the
intermediate pages aren't tested.

If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?

Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.

ISTM normal invalidation logic should just take of that kind of thing.

Do you mean to say that we don't need to add any new invalidation call
and the existing invalidation calls will automatically take care of
same?

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.

But it really isn't contained to freespace.c. That's my primary
concern.

Okay, I get that point. I think among that also the need to call
FSMClearLocalMap seems to be your main worry which is fair, but OTOH,
the places where it should be called shouldn't be a ton.

You added new parameters (undocumented ones!),

I think this is mostly for compatibility with the old code. I agree
that is a wart, but without much inputs during development, it
doesn't seem advisable to change old behavior, that is why we have
added a new parameter to GetPageWithFreeSpace. However, if we want we
can remove that parameter or add document it in a better way.

FSMClearLocalMap() needs to be called by callers and xlog, etc.

Agreed, that this is an additional requirement, but we have documented
the cases atop of this function where it needs to be called. We might
have missed something, but we tried to cover all cases that we are
aware of. Can we make it more clear by adding the comments atop
freespace.c API where this map is used?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11

amit.kapila16@gmail.com

over 6 years ago

In reply to: Tom Lane (#9)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 17, 2019 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.

But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.

Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure?

I respect and will follow whatever will be the consensus after
discussion. However, I request you to wait for some time to let the
discussion conclude. If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#11)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I respect and will follow whatever will be the consensus after
discussion. However, I request you to wait for some time to let the
discussion conclude. If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.

Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:

1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()

Thoughts?

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#9)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-17 12:20:29 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.

But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.

Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure? It's not like there's not anything else on our
plates to fix before beta.

Hm. I'm of split mind here:

It's a nice improvement, and the fixes probably wouldn't be that
hard. And we could have piped up a bit earlier about these concerns (I
only noticed this when rebasing zheap onto the newest version of
postgres).

But as you it's also late, and there's other stuff to do. Although I
think neither Amit nor John is heavily involved in any...

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

Regards,

Andres

#14

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#13)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

Seems reasonable. I think we should shoot to have this resolved before
the end of the month, but it doesn't have to be done immediately.

regards, tom lane

#15

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#12)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, Apr 18, 2019 at 2:10 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I respect and will follow whatever will be the consensus after
discussion. However, I request you to wait for some time to let the
discussion conclude. If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.

Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:

1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()

Won't you need an extra call to RelationGetNumberofBlocks to find the
last block? Also won't it be less efficient in terms of dealing with
bloat as compare to current patch? I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.

I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Amit Kapila (#15)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Amit Kapila <amit.kapila16@gmail.com> writes:

I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

FWIW, it's hard for me to see how moving the map to the relcache isn't
the right thing to do. You will lose state during a relcache flush,
but that's still miles better than how often the state gets lost now.

regards, tom lane

#17

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#16)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On April 18, 2019 7:53:58 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

I am thinking that we should at least give it a try to move the map

to

rel cache level to see how easy or difficult it is and also let's

wait

for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

FWIW, it's hard for me to see how moving the map to the relcache isn't
the right thing to do. You will lose state during a relcache flush,
but that's still miles better than how often the state gets lost now.

+1
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#18

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#15)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, Apr 19, 2019 at 10:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 18, 2019 at 2:10 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:

1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()

Won't you need an extra call to RelationGetNumberofBlocks to find the
last block?

If I understand you correctly, no, the call now in
GetPageWithFreeSpace() just moves it back to where it was in v11. In
the corner case where we just measured the table size and the last
block is full, we can pass nblocks to RecordAndGetPageWithFreeSpace().
There might be further optimizations available if we're not creating a
local map.

Also won't it be less efficient in terms of dealing with
bloat as compare to current patch?

Yes. The threshold would have to be 2 or 3 blocks, and it would stay
bloated until it passed the threshold. Not great, but perhaps not bad
either.

I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.

It's a much lighter-weight API, which has that much going for it.
I have a draft implementation, which I can share if it comes to that
-- it needs some more thought and polish first.

I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

Since we have a definite timeline, I'm okay with that, although I'm
afraid I'm not quite knowledgeable enough to help much with the
relcache piece.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#18)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, Apr 19, 2019 at 1:17 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Fri, Apr 19, 2019 at 10:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.

It's a much lighter-weight API, which has that much going for it.
I have a draft implementation, which I can share if it comes to that
-- it needs some more thought and polish first.

I understand that it is lighter-weight API, but OTOH, it will be less
efficient as well. Also, the consensus seems to be that we should
move this to relcache.

I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

Since we have a definite timeline, I'm okay with that, although I'm
afraid I'm not quite knowledgeable enough to help much with the
relcache piece.

Okay, I can try to help. I think you can start by looking at
RelationData members (for ex. see how we cache index's metapage in
rd_amcache) and study a bit about routines in relcache.h.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20

amit.kapila16@gmail.com

over 6 years ago

In reply to: Amit Kapila (#19)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, Apr 19, 2019 at 2:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.

Since we have a definite timeline, I'm okay with that, although I'm
afraid I'm not quite knowledgeable enough to help much with the
relcache piece.

Okay, I can try to help. I think you can start by looking at
RelationData members (for ex. see how we cache index's metapage in
rd_amcache) and study a bit about routines in relcache.h.

Attached is a hacky and work-in-progress patch to move fsm map to
relcache. This will give you some idea. I think we need to see if
there is a need to invalidate the relcache due to this patch. I think
some other comments of Andres also need to be addressed, see if you
can attempt to fix some of them.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

store_local_map_relcache_v1.patchapplication/octet-stream; name=store_local_map_relcache_v1.patchDownload

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 69a7a23874..d854a84bf3 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -500,12 +500,6 @@ loop:
 			/* use this page as future insert target, too */
 			RelationSetTargetBlock(relation, targetBlock);
 
-			/*
-			 * In case we used an in-memory map of available blocks, reset it
-			 * for next use.
-			 */
-			FSMClearLocalMap();
-
 			return buffer;
 		}
 
@@ -675,8 +669,5 @@ loop:
 	 */
 	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
 
-	/* This should already be cleared by now, but make sure it is. */
-	FSMClearLocalMap();
-
 	return buffer;
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8522a2f51c..d14c41ddaa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2587,12 +2587,6 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	/* Clean up buffer I/O and buffer context locks, too */
 	AbortBufferIO();
 	UnlockBuffers();
@@ -4881,12 +4875,6 @@ AbortSubTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	AbortBufferIO();
 	UnlockBuffers();
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c3ed4242e2..107b2da030 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -30,6 +30,7 @@
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "utils/memutils.h"
 
 
 /*
@@ -76,14 +77,6 @@
 #define FSM_ROOT_LEVEL	(FSM_TREE_DEPTH - 1)
 #define FSM_BOTTOM_LEVEL 0
 
-/* Status codes for the local map. */
-
-/* Either already tried, or beyond the end of the relation */
-#define FSM_LOCAL_NOT_AVAIL 0x00
-
-/* Available to try */
-#define FSM_LOCAL_AVAIL		0x01
-
 /*
  * The internal FSM routines work on a logical addressing scheme. Each
  * level of the tree can be thought of as a separately addressable file.
@@ -97,31 +90,7 @@ typedef struct
 /* Address of the root page. */
 static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
 
-/*
- * For small relations, we don't create FSM to save space, instead we use
- * local in-memory map of pages to try.  To locate free space, we simply try
- * pages directly without knowing ahead of time how much free space they have.
- *
- * Note that this map is used to the find the block with required free space
- * for any given relation.  We clear this map when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
- * See src/backend/storage/freespace/README for further details.
- */
-typedef struct
-{
-	BlockNumber nblocks;
-	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
-}			FSMLocalMap;
-
-static FSMLocalMap fsm_local_map =
-{
-	0,
-	{
-		FSM_LOCAL_NOT_AVAIL
-	}
-};
-
-#define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)
+#define FSM_LOCAL_MAP_EXISTS(rel) (rel->fsm_local_map->nblocks > 0)
 
 /* functions to navigate the tree */
 static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
@@ -143,12 +112,13 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static void fsm_local_set(Relation rel, BlockNumber cur_nblocks);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
-static BlockNumber fsm_local_search(void);
+static BlockNumber fsm_local_search(Relation rel);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
 				BlockNumber start, BlockNumber end,
 				bool *eof);
 static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
 				 BlockNumber nblocks, BlockNumber *get_nblocks);
+static void fsm_clear_local_map(Relation rel);
 
 
 /******** Public API ********/
@@ -195,14 +165,17 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
 			 * during bootstrapping or in a recently-started system.
 			 */
 			target_block = nblocks - 1;
+			fsm_clear_local_map(rel);
 		}
 		else if (nblocks > 0)
 		{
 			/* Initialize local map and get first candidate block. */
 			fsm_local_set(rel, nblocks);
-			target_block = fsm_local_search();
+			target_block = fsm_local_search(rel);
 		}
 	}
+	else
+		fsm_clear_local_map(rel);
 
 	return target_block;
 }
@@ -231,14 +204,14 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 	BlockNumber nblocks = InvalidBlockNumber;
 
 	/* First try the local map, if it exists. */
-	if (FSM_LOCAL_MAP_EXISTS)
+	if (FSM_LOCAL_MAP_EXISTS(rel))
 	{
 		Assert((rel->rd_rel->relkind == RELKIND_RELATION ||
 				rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-			   fsm_local_map.map[oldPage] == FSM_LOCAL_AVAIL);
+			   rel->fsm_local_map->map[oldPage] == FSM_LOCAL_AVAIL);
 
-		fsm_local_map.map[oldPage] = FSM_LOCAL_NOT_AVAIL;
-		return fsm_local_search();
+		rel->fsm_local_map->map[oldPage] = FSM_LOCAL_NOT_AVAIL;
+		return fsm_local_search(rel);
 	}
 
 	if (!fsm_allow_writes(rel, oldPage, InvalidBlockNumber, &nblocks))
@@ -249,8 +222,10 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 		 * need to create the local map.
 		 */
 		fsm_local_set(rel, nblocks);
-		return fsm_local_search();
+		return fsm_local_search(rel);
 	}
+	else
+		fsm_clear_local_map(rel);
 
 	/* Normal FSM logic follows */
 
@@ -302,18 +277,21 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 }
 
 /*
- * Clear the local map.  We must call this when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
+ * Initialize FSM Relation Map.
  */
 void
-FSMClearLocalMap(void)
+RelationInitFSMMap(Relation rel)
 {
-	if (FSM_LOCAL_MAP_EXISTS)
-	{
-		fsm_local_map.nblocks = 0;
-		memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
-			   sizeof(fsm_local_map.map));
-	}
+	FSMLocalMap *fsm_local_map;
+	MemoryContext	oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(CacheMemoryContext);
+	fsm_local_map = palloc(sizeof(FSMLocalMap));
+	fsm_local_map->nblocks = 0;
+	memset(&fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
+		   sizeof(fsm_local_map->map));
+	rel->fsm_local_map = fsm_local_map;
+	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
@@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 	BlockNumber blkno,
 				cached_target_block;
 
-	/* The local map must not be set already. */
-	Assert(!FSM_LOCAL_MAP_EXISTS);
-
 	/*
 	 * Starting at the current last block in the relation and working
 	 * backwards, mark alternating blocks as available.
@@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 	blkno = cur_nblocks - 1;
 	while (true)
 	{
-		fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+		rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
 		if (blkno >= 2)
 			blkno -= 2;
 		else
@@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 	}
 
 	/* Cache the number of blocks. */
-	fsm_local_map.nblocks = cur_nblocks;
+	rel->fsm_local_map->nblocks = cur_nblocks;
 
 	/* Set the status of the cached target block to 'unavailable'. */
 	cached_target_block = RelationGetTargetBlock(rel);
 	if (cached_target_block != InvalidBlockNumber &&
 		cached_target_block < cur_nblocks)
-		fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+		rel->fsm_local_map->map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
 }
 
 /*
@@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
  * This function is used when there is no FSM.
  */
 static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
 {
 	BlockNumber target_block;
 
 	/* Local map must be set by now. */
-	Assert(FSM_LOCAL_MAP_EXISTS);
+	Assert(FSM_LOCAL_MAP_EXISTS(rel));
 
-	target_block = fsm_local_map.nblocks;
+	target_block = rel->fsm_local_map->nblocks;
 	do
 	{
 		target_block--;
-		if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+		if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
 			return target_block;
 	} while (target_block > 0);
 
@@ -1189,7 +1164,22 @@ fsm_local_search(void)
 	 * first, which would otherwise lead to the same conclusion again and
 	 * again.
 	 */
-	FSMClearLocalMap();
+	fsm_clear_local_map(rel);
 
 	return InvalidBlockNumber;
 }
+
+/*
+ * Clear the local map.  We must call this when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ */
+static void
+fsm_clear_local_map(Relation rel)
+{
+	if (FSM_LOCAL_MAP_EXISTS(rel))
+	{
+		rel->fsm_local_map->nblocks = 0;
+		memset(&rel->fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
+			sizeof(rel->fsm_local_map->map));
+	}
+}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index bab59f16e6..57fac56c92 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -1226,6 +1227,9 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	 */
 	RelationInitPhysicalAddr(relation);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(relation);
+
 	/* make sure relation is marked as having no open file yet */
 	relation->rd_smgr = NULL;
 
@@ -1936,6 +1940,9 @@ formrdesc(const char *relationName, Oid relationReltype,
 	 */
 	RelationInitPhysicalAddr(relation);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(relation);
+
 	/*
 	 * initialize the table am handler
 	 */
@@ -2612,6 +2619,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 		SWAPFIELD(Oid, rd_toastoid);
 		/* pgstat_info must be preserved */
 		SWAPFIELD(struct PgStat_TableStatus *, pgstat_info);
+		SWAPFIELD(struct FSMLocalMap *, fsm_local_map);
 		/* preserve old partitioning info if no logical change */
 		if (keep_partkey)
 		{
@@ -3376,6 +3384,9 @@ RelationBuildLocalRelation(const char *relname,
 
 	RelationInitPhysicalAddr(rel);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(rel);
+
 	rel->rd_rel->relam = accessmtd;
 
 	if (relkind == RELKIND_RELATION ||
@@ -5665,6 +5676,9 @@ load_relcache_init_file(bool shared)
 		 */
 		RelationInitLockInfo(rel);
 		RelationInitPhysicalAddr(rel);
+
+		/* Initialize FSM map for the relation. */
+		RelationInitFSMMap(rel);
 	}
 
 	/*
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index dbaae651c5..2cabb725e2 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -21,6 +21,31 @@
 /* Only create the FSM if the heap has greater than this many blocks */
 #define HEAP_FSM_CREATION_THRESHOLD 4
 
+/*
+ * For small relations, we don't create FSM to save space, instead we use
+ * local in-memory map of pages to try.  To locate free space, we simply try
+ * pages directly without knowing ahead of time how much free space they have.
+ *
+ * Note that this map is used to the find the block with required free space
+ * for any given relation.  We clear this map when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ * See src/backend/storage/freespace/README for further details.
+ */
+typedef struct FSMLocalMap
+{
+	BlockNumber nblocks;
+	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
+}			FSMLocalMap;
+
+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL		0x01
+
+
 /* prototypes for public functions in freespace.c */
 extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
 extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
@@ -31,7 +56,7 @@ extern BlockNumber RecordAndGetPageWithFreeSpace(Relation rel,
 							  Size spaceNeeded);
 extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 						Size spaceAvail, BlockNumber nblocks);
-extern void FSMClearLocalMap(void);
+extern void RelationInitFSMMap(Relation rel);
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 							Size spaceAvail);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index bddfcafe26..5e12c40890 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -200,6 +200,7 @@ typedef struct RelationData
 
 	/* use "struct" here to avoid needing to include pgstat.h: */
 	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+	struct FSMLocalMap *fsm_local_map;
 } RelationData;

#21

andres@anarazel.de

over 6 years ago

In reply to: Amit Kapila (#20)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:

Attached is a hacky and work-in-progress patch to move fsm map to
relcache. This will give you some idea. I think we need to see if
there is a need to invalidate the relcache due to this patch. I think
some other comments of Andres also need to be addressed, see if you
can attempt to fix some of them.

/*
@@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
BlockNumber blkno,
cached_target_block;

-	/* The local map must not be set already. */
-	Assert(!FSM_LOCAL_MAP_EXISTS);
-
/*
* Starting at the current last block in the relation and working
* backwards, mark alternating blocks as available.
@@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
blkno = cur_nblocks - 1;
while (true)
{
-		fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+		rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
if (blkno >= 2)
blkno -= 2;
else
@@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
}

/* Cache the number of blocks. */
-	fsm_local_map.nblocks = cur_nblocks;
+	rel->fsm_local_map->nblocks = cur_nblocks;

/* Set the status of the cached target block to 'unavailable'. */
cached_target_block = RelationGetTargetBlock(rel);
if (cached_target_block != InvalidBlockNumber &&
cached_target_block < cur_nblocks)
-		fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+		rel->fsm_local_map->map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
}

I think there shouldn't be any need for this anymore. After this change
we shouldn't invalidate the map until there's no space on it - thereby
addressing the cost overhead, and greatly reducing the likelihood that
the local FSM can lead to increased bloat.

/*
@@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
* This function is used when there is no FSM.
*/
static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
{
BlockNumber target_block;

/* Local map must be set by now. */
-	Assert(FSM_LOCAL_MAP_EXISTS);
+	Assert(FSM_LOCAL_MAP_EXISTS(rel));

-	target_block = fsm_local_map.nblocks;
+	target_block = rel->fsm_local_map->nblocks;
do
{
target_block--;
-		if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+		if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
return target_block;
} while (target_block > 0);

@@ -1189,7 +1164,22 @@ fsm_local_search(void)
* first, which would otherwise lead to the same conclusion again and
* again.
*/
-	FSMClearLocalMap();
+	fsm_clear_local_map(rel);

I'm not sure I like this. My inclination would be that we should be able
to check the local fsm repeatedly even if there's no space in the
in-memory representation - otherwise the use of the local FSM increases
bloat.

Greetings,

Andres Freund

#22

amit.kapila16@gmail.com

over 6 years ago

In reply to: Andres Freund (#21)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Mon, Apr 22, 2019 at 10:34 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:

/*
@@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
BlockNumber blkno,
cached_target_block;

-     /* The local map must not be set already. */
-     Assert(!FSM_LOCAL_MAP_EXISTS);
-
/*
* Starting at the current last block in the relation and working
* backwards, mark alternating blocks as available.
@@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
blkno = cur_nblocks - 1;
while (true)
{
-             fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+             rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
if (blkno >= 2)
blkno -= 2;
else
@@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
}

/* Cache the number of blocks. */
-     fsm_local_map.nblocks = cur_nblocks;
+     rel->fsm_local_map->nblocks = cur_nblocks;

/* Set the status of the cached target block to 'unavailable'. */
cached_target_block = RelationGetTargetBlock(rel);
if (cached_target_block != InvalidBlockNumber &&
cached_target_block < cur_nblocks)
-             fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+             rel->fsm_local_map->map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
}

If we invalidate it only when there's no space on the page, then when
should we set it back to available, because if we don't do that, then
we might miss the space due to concurrent deletes.

/*
@@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
* This function is used when there is no FSM.
*/
static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
{
BlockNumber target_block;
/* Local map must be set by now. */
-     Assert(FSM_LOCAL_MAP_EXISTS);
+     Assert(FSM_LOCAL_MAP_EXISTS(rel));
-     target_block = fsm_local_map.nblocks;
+     target_block = rel->fsm_local_map->nblocks;
do
{
target_block--;
-             if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+             if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
return target_block;
} while (target_block > 0);
@@ -1189,7 +1164,22 @@ fsm_local_search(void)
* first, which would otherwise lead to the same conclusion again and
* again.
*/
-     FSMClearLocalMap();
+     fsm_clear_local_map(rel);
I'm not sure I like this. My inclination would be that we should be able
to check the local fsm repeatedly even if there's no space in the
in-memory representation - otherwise the use of the local FSM increases
bloat.

Do you mean to say that we always check all the pages (say 4)
irrespective of their state in the local map?

I think we should first try to see in this new scheme (a) when to set
the map, (b) when to clear it, (c) how to use. I have tried to
summarize my thoughts about it, let me know what do you think about
the same?

When to set the map.
At the beginning (the first time relation is used in the backend), the
map will be clear. When the first time in the backend, we find that
FSM doesn't exist and the number of blocks is lesser than
HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
exist at that time and mark all or alternate blocks as available.

Also, when we find that none of the blocks are available in the map
(basically they are marked invalid which means we have previously
checked that there is no space in them), we should get the number of
blocks and if they are less than the threshold, then add it to the
map.

When to clear the map?
Once we find out that the particular page doesn't have space, we can
mark the corresponding page in the map as invalid (or not available to
check). After relation extension, we can check if the latest block is
greater than the threshold value, then we can clear the map. At
truncate or some other similar times, when relcache entry is
invalidated, automatically the map will be cleared.

How to use the map?
Now, whenever we find the map exists, we can check the blocks that are
marked as available in it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#23

andres@anarazel.de

over 6 years ago

In reply to: Amit Kapila (#22)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-23 15:46:17 +0530, Amit Kapila wrote:

On Mon, Apr 22, 2019 at 10:34 PM Andres Freund <andres@anarazel.de> wrote:
On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
/*
@@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
BlockNumber blkno,
cached_target_block;
-     /* The local map must not be set already. */
-     Assert(!FSM_LOCAL_MAP_EXISTS);
-
/*
* Starting at the current last block in the relation and working
* backwards, mark alternating blocks as available.
@@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
blkno = cur_nblocks - 1;
while (true)
{
-             fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+             rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
if (blkno >= 2)
blkno -= 2;
else
@@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
}
/* Cache the number of blocks. */
-     fsm_local_map.nblocks = cur_nblocks;
+     rel->fsm_local_map->nblocks = cur_nblocks;
/* Set the status of the cached target block to 'unavailable'. */
cached_target_block = RelationGetTargetBlock(rel);
if (cached_target_block != InvalidBlockNumber &&
cached_target_block < cur_nblocks)
-             fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+             rel->fsm_local_map->map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
}
I think there shouldn't be any need for this anymore. After this change
we shouldn't invalidate the map until there's no space on it - thereby
addressing the cost overhead, and greatly reducing the likelihood that
the local FSM can lead to increased bloat.

If we invalidate it only when there's no space on the page, then when
should we set it back to available, because if we don't do that, then
we might miss the space due to concurrent deletes.

Well, deletes don't traditionally (i.e. with an actual FSM) mark free
space as available (for heap). I think RecordPageWithFreeSpace() should
issue a invalidation if there's no FSM, and the block goes from full to
empty (as there's no half-full IIUC).

/*
@@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
* This function is used when there is no FSM.
*/
static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
{
BlockNumber target_block;
/* Local map must be set by now. */
-     Assert(FSM_LOCAL_MAP_EXISTS);
+     Assert(FSM_LOCAL_MAP_EXISTS(rel));
-     target_block = fsm_local_map.nblocks;
+     target_block = rel->fsm_local_map->nblocks;
do
{
target_block--;
-             if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+             if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
return target_block;
} while (target_block > 0);
@@ -1189,7 +1164,22 @@ fsm_local_search(void)
* first, which would otherwise lead to the same conclusion again and
* again.
*/
-     FSMClearLocalMap();
+     fsm_clear_local_map(rel);
I'm not sure I like this. My inclination would be that we should be able
to check the local fsm repeatedly even if there's no space in the
in-memory representation - otherwise the use of the local FSM increases
bloat.
Do you mean to say that we always check all the pages (say 4)
irrespective of their state in the local map?

I was wondering that, yes. But I think just issuing invalidations is the
right approach instead, see above.

I think we should first try to see in this new scheme (a) when to set
the map, (b) when to clear it, (c) how to use. I have tried to
summarize my thoughts about it, let me know what do you think about
the same?

When to set the map.
At the beginning (the first time relation is used in the backend), the
map will be clear. When the first time in the backend, we find that
FSM doesn't exist and the number of blocks is lesser than
HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
exist at that time and mark all or alternate blocks as available.

I think the alternate blocks scheme has to go. It's not defensible.

Greetings,

Andres Freund

#24

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#23)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

On 2019-04-23 15:46:17 +0530, Amit Kapila wrote:

If we invalidate it only when there's no space on the page, then when
should we set it back to available, because if we don't do that, then
we might miss the space due to concurrent deletes.

Well, deletes don't traditionally (i.e. with an actual FSM) mark free
space as available (for heap). I think RecordPageWithFreeSpace() should
issue a invalidation if there's no FSM, and the block goes from full to
empty (as there's no half-full IIUC).

Why wouldn't we implement this just as a mini four-entry FSM stored in
the relcache, and update the entries according to the same rules we'd
use for regular FSM entries?

regards, tom lane

#25

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#24)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-23 13:31:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-04-23 15:46:17 +0530, Amit Kapila wrote:

If we invalidate it only when there's no space on the page, then when
should we set it back to available, because if we don't do that, then
we might miss the space due to concurrent deletes.

Well, deletes don't traditionally (i.e. with an actual FSM) mark free
space as available (for heap). I think RecordPageWithFreeSpace() should
issue a invalidation if there's no FSM, and the block goes from full to
empty (as there's no half-full IIUC).

Why wouldn't we implement this just as a mini four-entry FSM stored in
the relcache, and update the entries according to the same rules we'd
use for regular FSM entries?

I mean the big difference is that it's not shared memory. So there needs
to be some difference. My suggestion to handle that is to just issue an
invalidation when *increasing* the amount of space.

And sure, leaving that aside we could store one byte per block - it's
just not what the patch has done so far (or rather, it used one byte per
block, but only utilized one bit of it). It's possible that'd come with
some overhead - I've not thought sufficiently about it: I assume we'd
still start out in each backend assuming each page is empty, and we'd
then rely on RelationGetBufferForTuple() to update that. What I wonder
is if we'd need to check if an on-disk FSM has been created every time
the space on a page is reduced? I think not, if we use invalidations to
notify others of the existance of an on-disk FSM. There's a small race,
but that seems ok.

Greetings,

Andres Freund

#26

amit.kapila16@gmail.com

over 6 years ago

In reply to: Andres Freund (#23)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Tue, Apr 23, 2019 at 10:59 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
/*
@@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
/* Set the status of the cached target block to 'unavailable'. */
cached_target_block = RelationGetTargetBlock(rel);
if (cached_target_block != InvalidBlockNumber &&
cached_target_block < cur_nblocks)
-             fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+             rel->fsm_local_map->map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
}
I think there shouldn't be any need for this anymore. After this change
we shouldn't invalidate the map until there's no space on it - thereby
addressing the cost overhead, and greatly reducing the likelihood that
the local FSM can lead to increased bloat.

I have removed the code that was invalidating cached target block from
the above function.

If we invalidate it only when there's no space on the page, then when
should we set it back to available, because if we don't do that, then
we might miss the space due to concurrent deletes.

Well, deletes don't traditionally (i.e. with an actual FSM) mark free
space as available (for heap). I think RecordPageWithFreeSpace() should
issue a invalidation if there's no FSM, and the block goes from full to
empty (as there's no half-full IIUC).

Sure, we can do that.

/*
@@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
* This function is used when there is no FSM.
*/
static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
{
BlockNumber target_block;
/* Local map must be set by now. */
-     Assert(FSM_LOCAL_MAP_EXISTS);
+     Assert(FSM_LOCAL_MAP_EXISTS(rel));
-     target_block = fsm_local_map.nblocks;
+     target_block = rel->fsm_local_map->nblocks;
do
{
target_block--;
-             if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+             if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
return target_block;
} while (target_block > 0);
@@ -1189,7 +1164,22 @@ fsm_local_search(void)
* first, which would otherwise lead to the same conclusion again and
* again.
*/
-     FSMClearLocalMap();
+     fsm_clear_local_map(rel);
I'm not sure I like this. My inclination would be that we should be able
to check the local fsm repeatedly even if there's no space in the
in-memory representation - otherwise the use of the local FSM increases
bloat.
Do you mean to say that we always check all the pages (say 4)
irrespective of their state in the local map?
I was wondering that, yes. But I think just issuing invalidations is the
right approach instead, see above.

Righ issuing invalidations can help with that.

I think we should first try to see in this new scheme (a) when to set
the map, (b) when to clear it, (c) how to use. I have tried to
summarize my thoughts about it, let me know what do you think about
the same?

When to set the map.
At the beginning (the first time relation is used in the backend), the
map will be clear. When the first time in the backend, we find that
FSM doesn't exist and the number of blocks is lesser than
HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
exist at that time and mark all or alternate blocks as available.

I think the alternate blocks scheme has to go. It's not defensible.

Fair enough, I have changed it in the attached patch. However, I
think we should test it once the patch is ready as we have seen a
small performance regression due to that.

And sure, leaving that aside we could store one byte per block

Hmm, I think you mean to say one-bit per block, right?

- it's
just not what the patch has done so far (or rather, it used one byte per
block, but only utilized one bit of it).

Right, I think this is an independently useful improvement, provided
it doesn't have any additional overhead or complexity.

It's possible that'd come with
some overhead - I've not thought sufficiently about it: I assume we'd
still start out in each backend assuming each page is empty, and we'd
then rely on RelationGetBufferForTuple() to update that. What I wonder
is if we'd need to check if an on-disk FSM has been created every time
the space on a page is reduced? I think not, if we use invalidations to
notify others of the existance of an on-disk FSM. There's a small race,
but that seems ok.

Do you mean to say that vacuum or some backend should invalidate in
case it first time creates the FSM for a relation? I think it is quite
possible that the vacuum takes time to trigger such invalidation if
there are fewer deletions. And we won't be able to use newly added
page/s.

IIUC, you are suggesting to issue invalidations when the (a) vacuum
finds there is no FSM and page has more space now, (b) invalidation to
notify the existence of FSM. IT seems to me that issuing more
invalidations for the purpose of FSM can lead to an increased number
of relation builds in the overall system. I think this can have an
impact when there are many small relations in the system which in some
scenarios might not be uncommon.

The two improvements in this code which are discussed in this thread
and can be done independently to this patch are:
a. use one bit to represent each block in the map. This gives us the
flexibility to use the map for the different threshold for some other
storage.
b. improve the usage of smgrexists by checking smgr_fsm_nblocks.

John, can you implement these two improvements either on HEAD or on
top of this patch?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

store_local_map_relcache_v2.patchapplication/octet-stream; name=store_local_map_relcache_v2.patchDownload

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 69a7a23874..d854a84bf3 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -500,12 +500,6 @@ loop:
 			/* use this page as future insert target, too */
 			RelationSetTargetBlock(relation, targetBlock);
 
-			/*
-			 * In case we used an in-memory map of available blocks, reset it
-			 * for next use.
-			 */
-			FSMClearLocalMap();
-
 			return buffer;
 		}
 
@@ -675,8 +669,5 @@ loop:
 	 */
 	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
 
-	/* This should already be cleared by now, but make sure it is. */
-	FSMClearLocalMap();
-
 	return buffer;
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8522a2f51c..d14c41ddaa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2587,12 +2587,6 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	/* Clean up buffer I/O and buffer context locks, too */
 	AbortBufferIO();
 	UnlockBuffers();
@@ -4881,12 +4875,6 @@ AbortSubTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	AbortBufferIO();
 	UnlockBuffers();
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c3ed4242e2..130649601c 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -30,6 +30,7 @@
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "utils/memutils.h"
 
 
 /*
@@ -76,14 +77,6 @@
 #define FSM_ROOT_LEVEL	(FSM_TREE_DEPTH - 1)
 #define FSM_BOTTOM_LEVEL 0
 
-/* Status codes for the local map. */
-
-/* Either already tried, or beyond the end of the relation */
-#define FSM_LOCAL_NOT_AVAIL 0x00
-
-/* Available to try */
-#define FSM_LOCAL_AVAIL		0x01
-
 /*
  * The internal FSM routines work on a logical addressing scheme. Each
  * level of the tree can be thought of as a separately addressable file.
@@ -97,31 +90,7 @@ typedef struct
 /* Address of the root page. */
 static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
 
-/*
- * For small relations, we don't create FSM to save space, instead we use
- * local in-memory map of pages to try.  To locate free space, we simply try
- * pages directly without knowing ahead of time how much free space they have.
- *
- * Note that this map is used to the find the block with required free space
- * for any given relation.  We clear this map when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
- * See src/backend/storage/freespace/README for further details.
- */
-typedef struct
-{
-	BlockNumber nblocks;
-	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
-}			FSMLocalMap;
-
-static FSMLocalMap fsm_local_map =
-{
-	0,
-	{
-		FSM_LOCAL_NOT_AVAIL
-	}
-};
-
-#define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)
+#define FSM_LOCAL_MAP_EXISTS(rel) (rel->fsm_local_map->nblocks > 0)
 
 /* functions to navigate the tree */
 static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
@@ -143,12 +112,13 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static void fsm_local_set(Relation rel, BlockNumber cur_nblocks);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
-static BlockNumber fsm_local_search(void);
+static BlockNumber fsm_local_search(Relation rel);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
 				BlockNumber start, BlockNumber end,
 				bool *eof);
 static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
 				 BlockNumber nblocks, BlockNumber *get_nblocks);
+static void fsm_clear_local_map(Relation rel);
 
 
 /******** Public API ********/
@@ -185,24 +155,33 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
 		 rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
 		!check_fsm_only)
 	{
-		nblocks = RelationGetNumberOfBlocks(rel);
-
-		if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
+		if (!FSM_LOCAL_MAP_EXISTS(rel))
 		{
-			/*
-			 * If the FSM knows nothing of the rel, try the last page before
-			 * we give up and extend.  This avoids one-tuple-per-page syndrome
-			 * during bootstrapping or in a recently-started system.
-			 */
-			target_block = nblocks - 1;
-		}
-		else if (nblocks > 0)
-		{
-			/* Initialize local map and get first candidate block. */
-			fsm_local_set(rel, nblocks);
-			target_block = fsm_local_search();
+			nblocks = RelationGetNumberOfBlocks(rel);
+
+			if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
+			{
+				/*
+				 * If the FSM knows nothing of the rel, try the last page before
+				 * we give up and extend.  This avoids one-tuple-per-page syndrome
+				 * during bootstrapping or in a recently-started system.
+				 */
+				target_block = nblocks - 1;
+				fsm_clear_local_map(rel);
+				return target_block;
+			}
+			else if (nblocks > 0)
+			{
+				/* Initialize the local map. */
+				fsm_local_set(rel, nblocks);
+				target_block = fsm_local_search(rel);
+			}
 		}
+		else
+			target_block = fsm_local_search(rel);
 	}
+	else
+		fsm_clear_local_map(rel);
 
 	return target_block;
 }
@@ -231,14 +210,14 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 	BlockNumber nblocks = InvalidBlockNumber;
 
 	/* First try the local map, if it exists. */
-	if (FSM_LOCAL_MAP_EXISTS)
+	if (FSM_LOCAL_MAP_EXISTS(rel))
 	{
 		Assert((rel->rd_rel->relkind == RELKIND_RELATION ||
 				rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-			   fsm_local_map.map[oldPage] == FSM_LOCAL_AVAIL);
+			   rel->fsm_local_map->map[oldPage] == FSM_LOCAL_AVAIL);
 
-		fsm_local_map.map[oldPage] = FSM_LOCAL_NOT_AVAIL;
-		return fsm_local_search();
+		rel->fsm_local_map->map[oldPage] = FSM_LOCAL_NOT_AVAIL;
+		return fsm_local_search(rel);
 	}
 
 	if (!fsm_allow_writes(rel, oldPage, InvalidBlockNumber, &nblocks))
@@ -249,8 +228,10 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 		 * need to create the local map.
 		 */
 		fsm_local_set(rel, nblocks);
-		return fsm_local_search();
+		return fsm_local_search(rel);
 	}
+	else
+		fsm_clear_local_map(rel);
 
 	/* Normal FSM logic follows */
 
@@ -302,18 +283,21 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 }
 
 /*
- * Clear the local map.  We must call this when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
+ * Initialize FSM Relation Map.
  */
 void
-FSMClearLocalMap(void)
+RelationInitFSMMap(Relation rel)
 {
-	if (FSM_LOCAL_MAP_EXISTS)
-	{
-		fsm_local_map.nblocks = 0;
-		memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
-			   sizeof(fsm_local_map.map));
-	}
+	FSMLocalMap *fsm_local_map;
+	MemoryContext	oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(CacheMemoryContext);
+	fsm_local_map = palloc(sizeof(FSMLocalMap));
+	fsm_local_map->nblocks = 0;
+	memset(&fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
+		   sizeof(fsm_local_map->map));
+	rel->fsm_local_map = fsm_local_map;
+	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
@@ -1122,41 +1106,26 @@ fsm_allow_writes(Relation rel, BlockNumber heapblk,
  * Initialize the local map of blocks to try, for when there is no FSM.
  *
  * When we initialize the map, the whole heap is potentially available to
- * try.  Testing revealed that trying every block can cause a small
- * performance dip compared to when we use a FSM, so we try every other
- * block instead.
+ * try.
  */
 static void
 fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 {
-	BlockNumber blkno,
-				cached_target_block;
-
-	/* The local map must not be set already. */
-	Assert(!FSM_LOCAL_MAP_EXISTS);
+	BlockNumber blkno;
 
 	/*
-	 * Starting at the current last block in the relation and working
-	 * backwards, mark alternating blocks as available.
+	 * Mark all blocks as available.
 	 */
 	blkno = cur_nblocks - 1;
-	while (true)
+	while (BlockNumberIsValid(blkno) &&
+		   blkno >= 0)
 	{
-		fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
-		if (blkno >= 2)
-			blkno -= 2;
-		else
-			break;
+		rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
+		blkno--;
 	}
 
 	/* Cache the number of blocks. */
-	fsm_local_map.nblocks = cur_nblocks;
-
-	/* Set the status of the cached target block to 'unavailable'. */
-	cached_target_block = RelationGetTargetBlock(rel);
-	if (cached_target_block != InvalidBlockNumber &&
-		cached_target_block < cur_nblocks)
-		fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+	rel->fsm_local_map->nblocks = cur_nblocks;
 }
 
 /*
@@ -1168,18 +1137,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
  * This function is used when there is no FSM.
  */
 static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
 {
 	BlockNumber target_block;
 
 	/* Local map must be set by now. */
-	Assert(FSM_LOCAL_MAP_EXISTS);
+	Assert(FSM_LOCAL_MAP_EXISTS(rel));
 
-	target_block = fsm_local_map.nblocks;
+	target_block = rel->fsm_local_map->nblocks;
 	do
 	{
 		target_block--;
-		if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+		if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
 			return target_block;
 	} while (target_block > 0);
 
@@ -1189,7 +1158,22 @@ fsm_local_search(void)
 	 * first, which would otherwise lead to the same conclusion again and
 	 * again.
 	 */
-	FSMClearLocalMap();
+	fsm_clear_local_map(rel);
 
 	return InvalidBlockNumber;
 }
+
+/*
+ * Clear the local map.  We must call this when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ */
+static void
+fsm_clear_local_map(Relation rel)
+{
+	if (FSM_LOCAL_MAP_EXISTS(rel))
+	{
+		rel->fsm_local_map->nblocks = 0;
+		memset(&rel->fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
+			sizeof(rel->fsm_local_map->map));
+	}
+}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index bab59f16e6..57fac56c92 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -1226,6 +1227,9 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	 */
 	RelationInitPhysicalAddr(relation);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(relation);
+
 	/* make sure relation is marked as having no open file yet */
 	relation->rd_smgr = NULL;
 
@@ -1936,6 +1940,9 @@ formrdesc(const char *relationName, Oid relationReltype,
 	 */
 	RelationInitPhysicalAddr(relation);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(relation);
+
 	/*
 	 * initialize the table am handler
 	 */
@@ -2612,6 +2619,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 		SWAPFIELD(Oid, rd_toastoid);
 		/* pgstat_info must be preserved */
 		SWAPFIELD(struct PgStat_TableStatus *, pgstat_info);
+		SWAPFIELD(struct FSMLocalMap *, fsm_local_map);
 		/* preserve old partitioning info if no logical change */
 		if (keep_partkey)
 		{
@@ -3376,6 +3384,9 @@ RelationBuildLocalRelation(const char *relname,
 
 	RelationInitPhysicalAddr(rel);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(rel);
+
 	rel->rd_rel->relam = accessmtd;
 
 	if (relkind == RELKIND_RELATION ||
@@ -5665,6 +5676,9 @@ load_relcache_init_file(bool shared)
 		 */
 		RelationInitLockInfo(rel);
 		RelationInitPhysicalAddr(rel);
+
+		/* Initialize FSM map for the relation. */
+		RelationInitFSMMap(rel);
 	}
 
 	/*
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index dbaae651c5..2cabb725e2 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -21,6 +21,31 @@
 /* Only create the FSM if the heap has greater than this many blocks */
 #define HEAP_FSM_CREATION_THRESHOLD 4
 
+/*
+ * For small relations, we don't create FSM to save space, instead we use
+ * local in-memory map of pages to try.  To locate free space, we simply try
+ * pages directly without knowing ahead of time how much free space they have.
+ *
+ * Note that this map is used to the find the block with required free space
+ * for any given relation.  We clear this map when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ * See src/backend/storage/freespace/README for further details.
+ */
+typedef struct FSMLocalMap
+{
+	BlockNumber nblocks;
+	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
+}			FSMLocalMap;
+
+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL		0x01
+
+
 /* prototypes for public functions in freespace.c */
 extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
 extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
@@ -31,7 +56,7 @@ extern BlockNumber RecordAndGetPageWithFreeSpace(Relation rel,
 							  Size spaceNeeded);
 extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 						Size spaceAvail, BlockNumber nblocks);
-extern void FSMClearLocalMap(void);
+extern void RelationInitFSMMap(Relation rel);
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 							Size spaceAvail);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index bddfcafe26..5e12c40890 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -200,6 +200,7 @@ typedef struct RelationData
 
 	/* use "struct" here to avoid needing to include pgstat.h: */
 	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+	struct FSMLocalMap *fsm_local_map;
 } RelationData;

#27

andres@anarazel.de

over 6 years ago

In reply to: Amit Kapila (#26)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-24 11:28:32 +0530, Amit Kapila wrote:

On Tue, Apr 23, 2019 at 10:59 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:

I think we should first try to see in this new scheme (a) when to set
the map, (b) when to clear it, (c) how to use. I have tried to
summarize my thoughts about it, let me know what do you think about
the same?

When to set the map.
At the beginning (the first time relation is used in the backend), the
map will be clear. When the first time in the backend, we find that
FSM doesn't exist and the number of blocks is lesser than
HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
exist at that time and mark all or alternate blocks as available.

I think the alternate blocks scheme has to go. It's not defensible.

Fair enough, I have changed it in the attached patch. However, I
think we should test it once the patch is ready as we have seen a
small performance regression due to that.

Sure, but that was because you re-scanned from scratch after every
insertion, no?

And sure, leaving that aside we could store one byte per block

Hmm, I think you mean to say one-bit per block, right?

No, I meant byte. The normal FSM saves how much space there is for a
block, but the current local fsm doesn't. That means pages are marked as
unavailble even though other tuples would possibly fit.

It's possible that'd come with
some overhead - I've not thought sufficiently about it: I assume we'd
still start out in each backend assuming each page is empty, and we'd
then rely on RelationGetBufferForTuple() to update that. What I wonder
is if we'd need to check if an on-disk FSM has been created every time
the space on a page is reduced? I think not, if we use invalidations to
notify others of the existance of an on-disk FSM. There's a small race,
but that seems ok.

Do you mean to say that vacuum or some backend should invalidate in
case it first time creates the FSM for a relation?

Right.

I think it is quite possible that the vacuum takes time to trigger
such invalidation if there are fewer deletions. And we won't be able
to use newly added page/s.

I'm not sure I understand what you mean by that? If the backend that
ends up creating the FSM - because it extended the relation beyond 4
pages - issues an invalidation, the time till other backends pick that
up should be minimal?

IIUC, you are suggesting to issue invalidations when the (a) vacuum
finds there is no FSM and page has more space now, (b) invalidation to
notify the existence of FSM. IT seems to me that issuing more
invalidations for the purpose of FSM can lead to an increased number
of relation builds in the overall system. I think this can have an
impact when there are many small relations in the system which in some
scenarios might not be uncommon.

If this becomes an issue I'd just create a separate type of
invalidation, one that just signals that the FSM is being invalidated.

Greetings,

Andres Freund

#28

amit.kapila16@gmail.com

over 6 years ago

In reply to: Andres Freund (#27)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 24, 2019 at 9:49 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-24 11:28:32 +0530, Amit Kapila wrote:

On Tue, Apr 23, 2019 at 10:59 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:

I think we should first try to see in this new scheme (a) when to set
the map, (b) when to clear it, (c) how to use. I have tried to
summarize my thoughts about it, let me know what do you think about
the same?

When to set the map.
At the beginning (the first time relation is used in the backend), the
map will be clear. When the first time in the backend, we find that
FSM doesn't exist and the number of blocks is lesser than
HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
exist at that time and mark all or alternate blocks as available.

I think the alternate blocks scheme has to go. It's not defensible.

Fair enough, I have changed it in the attached patch. However, I
think we should test it once the patch is ready as we have seen a
small performance regression due to that.

Sure, but that was because you re-scanned from scratch after every
insertion, no?

Possible.

And sure, leaving that aside we could store one byte per block

Hmm, I think you mean to say one-bit per block, right?

No, I meant byte.

Upthread you have said: "Hm, given realistic
HEAP_FSM_CREATION_THRESHOLD, and the fact that we really only need one
bit per relation, it seems like map should really be just a uint32
with one bit per page. It'd allow different AMs to have different
numbers of dont-create-fsm thresholds without needing additional
memory (up to 32 blocks)."

I can understand the advantage of one-bit per-page suggestion, but now
you are telling one-byte per-page. I am confused between those two
options. Am, I missing something?

The normal FSM saves how much space there is for a
block, but the current local fsm doesn't. That means pages are marked as
unavailble even though other tuples would possibly fit.

Sure, in regular FSM, the vacuum can update the available space, but
we don't have such a provision for local map unless we decide to keep
it in shared memory.

It's possible that'd come with
some overhead - I've not thought sufficiently about it: I assume we'd
still start out in each backend assuming each page is empty, and we'd
then rely on RelationGetBufferForTuple() to update that. What I wonder
is if we'd need to check if an on-disk FSM has been created every time
the space on a page is reduced? I think not, if we use invalidations to
notify others of the existance of an on-disk FSM. There's a small race,
but that seems ok.

Do you mean to say that vacuum or some backend should invalidate in
case it first time creates the FSM for a relation?

Right.

I think it is quite possible that the vacuum takes time to trigger
such invalidation if there are fewer deletions. And we won't be able
to use newly added page/s.

I'm not sure I understand what you mean by that? If the backend that
ends up creating the FSM - because it extended the relation beyond 4
pages - issues an invalidation, the time till other backends pick that
up should be minimal?

Consider when a backend-1 starts inserting into a relation, it has
just two pages and we create a local map in the relation which has two
pages. Now, backend-2 extends the relation by one-page, how and when
will backend-1 comes to know about that. One possibility is that once
all the pages present in backend-1's relation becomes invalid
(used-up), we again check the number of blocks and update the local
map.

IIUC, you are suggesting to issue invalidations when the (a) vacuum
finds there is no FSM and page has more space now, (b) invalidation to
notify the existence of FSM. IT seems to me that issuing more
invalidations for the purpose of FSM can lead to an increased number
of relation builds in the overall system. I think this can have an
impact when there are many small relations in the system which in some
scenarios might not be uncommon.

If this becomes an issue I'd just create a separate type of
invalidation, one that just signals that the FSM is being invalidated.

Oh, clever idea, but I guess that will be some work unless we already
do something similar elsewhere. Anyway, we can look into it later.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#26)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 24, 2019 at 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

The two improvements in this code which are discussed in this thread
and can be done independently to this patch are:
a. use one bit to represent each block in the map. This gives us the
flexibility to use the map for the different threshold for some other
storage.
b. improve the usage of smgrexists by checking smgr_fsm_nblocks.

John, can you implement these two improvements either on HEAD or on
top of this patch?

I've done B in the attached. There is a more recent idea of using the
byte to store the actual free space in the same format as the FSM.
That might be v13 material, but in any case, I'll hold off on A for
now.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

optimize-smgrexists.patchapplication/octet-stream; name=optimize-smgrexists.patchDownload

diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 130649601c..11fcc038d1 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -1087,9 +1087,30 @@ fsm_allow_writes(Relation rel, BlockNumber heapblk,
 			skip_get_nblocks = false;
 	}
 
+	/* Might have to re-open if a cache flush happened */
 	RelationOpenSmgr(rel);
-	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+
+	/* If we haven't cached the size of the FSM yet, check it directly. */
+	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber)
+	{
+		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+		{
+			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+														 FSM_FORKNUM);
+			return true;
+		}
+		else
+		{
+			rel->rd_smgr->smgr_fsm_nblocks = 0;
+		}
+	}
+	/* If smgr_fsm_nblocks is positive then it must exist. */
+	else if (rel->rd_smgr->smgr_fsm_nblocks > 0)
+	{
 		return true;
+	}
+
+	/* smgr_fsm_nblocks == 0 means it doesn't exist, do nothing. */
 
 	if (skip_get_nblocks)
 		return false;

#30

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#26)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 24, 2019 at 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

<v2 patch>

Sorry for not noticing earlier, but this patch causes a regression
test failure for me (attached)

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-jcn-regression.diffsapplication/octet-stream; name=v2-jcn-regression.diffsDownload

diff -U3 /Users/john/pgtest/master/build/../../../pgdev/postgres/src/test/regress/expected/alter_table.out /Users/john/pgtest/master/build/src/test/regress/results/alter_table.out
--- /Users/john/pgtest/master/build/../../../pgdev/postgres/src/test/regress/expected/alter_table.out	2019-03-14 09:07:56.000000000 +0800
+++ /Users/john/pgtest/master/build/src/test/regress/results/alter_table.out	2019-04-25 15:03:05.000000000 +0800
@@ -1431,20 +1431,19 @@
 ERROR:  column "........pg.dropped.1........" does not exist
 -- test create as and select into
 insert into atacc1 values (21, 22, 23);
+ERROR:  could not read block 0 in file "base/16384/31379": read only 0 of 8192 bytes
 create table attest1 as select * from atacc1;
 select * from attest1;
- b  | c  | d  
-----+----+----
- 21 | 22 | 23
-(1 row)
+ b | c | d 
+---+---+---
+(0 rows)
 
 drop table attest1;
 select * into attest2 from atacc1;
 select * from attest2;
- b  | c  | d  
-----+----+----
- 21 | 22 | 23
-(1 row)
+ b | c | d 
+---+---+---
+(0 rows)
 
 drop table attest2;
 -- try dropping all columns
@@ -1453,7 +1452,7 @@
 alter table atacc1 drop b;
 select * from atacc1;
 --
-(1 row)
+(0 rows)
 
 drop table atacc1;
 -- test constraint error reporting in presence of dropped columns

#31

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#30)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, Apr 25, 2019 at 12:39 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:

On Wed, Apr 24, 2019 at 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

<v2 patch>

Sorry for not noticing earlier, but this patch causes a regression
test failure for me (attached)

Can you please try to finish the remaining work of the patch (I am bit
tied up with some other things)? I think the main thing apart from
representation of map as one-byte or one-bit per block is to implement
invalidation. Also, try to see if there is anything pending which I
might have missed. As discussed above, we need to issue an
invalidation for following points: (a) when vacuum finds there is no
FSM and page has more space now, I think you can detect this in
RecordPageWithFreeSpace (b) invalidation to notify the existence of
FSM, this can be done both by vacuum and backend.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#32

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Andres Freund (#1)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.

I took a stab at untying the free space code from any knowledge about
heaps, and made it the responsibility of each access method that calls
these free space routines to specify their own threshold (possibly
zero). The attached applies on top of HEAD, because it's independent
of the relcache logic being discussed. If I can get that working, the
I'll rebase it on top of this API, if you like.

extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+                    bool check_fsm_only);
So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.

I split this up into 2 routines: GetPageWithFreeSpace() is now exactly
like it is in v11, and GetAlternatePage() is available for access
methods that can use it.

+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4
Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.

This was a bit harder than expected. Because of the pg_upgrade
optimization, it was impossible to put this symbol in hio.h or
heapam.h, because they include things unsafe for frontend code. I
decided to create heap_fe.h, which is a hack. Also, because they have
freespace.c callers, I put other thresholds in

src/backend/storage/freespace/indexfsm.c
src/include/access/brin_pageops.h

Putting the thresholds in 3 files with completely different purposes
is a mess, and serves no example for future access methods, but I
don't have a better idea.

On the upside, untying free space from heap allowed me to remove most
of the checks for

(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE)

except for the one in pg_upgrade.c, which is again a bit of a hack, but not bad.

Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.

Done. Regarding the idea upthread about using bytes to store ordinary
freespace values, I think that's better for correctness, but also
makes it more difficult to use different thresholds per access method.

+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+                BlockNumber nblocks, BlockNumber *get_nblocks)
+   RelationOpenSmgr(rel);
+   if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+       return true;
Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?

I did this in an earlier patch above -- do you have an opinion about that?

I also removed the call to smgrnblocks(smgr, MAIN_FORKNUM) from
XLogRecordPageWithFreeSpace() because I don't think it's actually
needed. There's a window where a table could have 5 blocks, but trying
to record space on, say, block 2 won't actually create the FSM on the
standby. When block 5 fills up enough, then the xlog call will cause
the FSM to be created. Not sure if this is best, but it saves another
syscall, and this function is only called when freespace is less than
20%, IIRC.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

untie-local-map-from-heap_v1.patchapplication/octet-stream; name=untie-local-map-from-heap_v1.patchDownload

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5c2b0c7635..a26e00c988 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1152,7 +1152,8 @@ terminate_brin_buildstate(BrinBuildState *state)
 		freespace = PageGetFreeSpace(page);
 		blk = BufferGetBlockNumber(state->bs_currentInsertBuf);
 		ReleaseBuffer(state->bs_currentInsertBuf);
-		RecordPageWithFreeSpace(state->bs_irel, blk, freespace, InvalidBlockNumber);
+		RecordPageWithFreeSpace(state->bs_irel, blk, freespace, InvalidBlockNumber,
+								BRIN_FSM_CREATION_THRESHOLD);
 		FreeSpaceMapVacuumRange(state->bs_irel, blk, blk + 1);
 	}
 
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 2e83aa42f7..56f6b0a6ba 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -310,7 +310,8 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		if (extended)
 		{
-			RecordPageWithFreeSpace(idxrel, newblk, freespace, InvalidBlockNumber);
+			RecordPageWithFreeSpace(idxrel, newblk, freespace, InvalidBlockNumber,
+									BRIN_FSM_CREATION_THRESHOLD);
 			FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1);
 		}
 
@@ -461,7 +462,8 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 
 	if (extended)
 	{
-		RecordPageWithFreeSpace(idxrel, blk, freespace, InvalidBlockNumber);
+		RecordPageWithFreeSpace(idxrel, blk, freespace, InvalidBlockNumber,
+								BRIN_FSM_CREATION_THRESHOLD);
 		FreeSpaceMapVacuumRange(idxrel, blk, blk + 1);
 	}
 
@@ -654,7 +656,8 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 
 	/* Measure free space and record it */
 	RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf),
-							br_page_get_freespace(page), InvalidBlockNumber);
+							br_page_get_freespace(page), InvalidBlockNumber,
+							BRIN_FSM_CREATION_THRESHOLD);
 }
 
 /*
@@ -703,7 +706,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 	/* Choose initial target page, re-using existing target if known */
 	newblk = RelationGetTargetBlock(irel);
 	if (newblk == InvalidBlockNumber)
-		newblk = GetPageWithFreeSpace(irel, itemsz, true);
+		newblk = GetPageWithFreeSpace(irel, itemsz);
 
 	/*
 	 * Loop until we find a page with sufficient free space.  By the time we
@@ -855,7 +858,8 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		 * Update the FSM with the new, presumably smaller, freespace value
 		 * for this page, then search for a new target page.
 		 */
-		newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
+		newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz,
+											   BRIN_FSM_CREATION_THRESHOLD);
 	}
 }
 
@@ -895,7 +899,8 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 	 * pages whose FSM records were forgotten in a crash.
 	 */
 	RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buffer),
-							br_page_get_freespace(page), InvalidBlockNumber);
+							br_page_get_freespace(page), InvalidBlockNumber,
+							BRIN_FSM_CREATION_THRESHOLD);
 }
 
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a05b6a07ad..24f33b2b0c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7747,7 +7747,8 @@ heap_xlog_clean(XLogReaderState *record)
 		 * Do this regardless of a full-page image being applied, since the
 		 * FSM data is not in the page anyway.
 		 */
-		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace,
+									HEAP_FSM_CREATION_THRESHOLD);
 	}
 }
 
@@ -7846,7 +7847,8 @@ heap_xlog_visible(XLogReaderState *record)
 		 * FSM data is not in the page anyway.
 		 */
 		if (xlrec->flags & VISIBILITYMAP_VALID_BITS)
-			XLogRecordPageWithFreeSpace(rnode, blkno, space);
+			XLogRecordPageWithFreeSpace(rnode, blkno, space,
+										HEAP_FSM_CREATION_THRESHOLD);
 	}
 
 	/*
@@ -8161,7 +8163,8 @@ heap_xlog_insert(XLogReaderState *record)
 	 * totally accurate anyway.
 	 */
 	if (action == BLK_NEEDS_REDO && freespace < BLCKSZ / 5)
-		XLogRecordPageWithFreeSpace(target_node, blkno, freespace);
+		XLogRecordPageWithFreeSpace(target_node, blkno, freespace,
+									HEAP_FSM_CREATION_THRESHOLD);
 }
 
 /*
@@ -8300,7 +8303,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
 	 * totally accurate anyway.
 	 */
 	if (action == BLK_NEEDS_REDO && freespace < BLCKSZ / 5)
-		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace,
+									HEAP_FSM_CREATION_THRESHOLD);
 }
 
 /*
@@ -8575,7 +8579,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 	 * totally accurate anyway.
 	 */
 	if (newaction == BLK_NEEDS_REDO && !hot_update && freespace < BLCKSZ / 5)
-		XLogRecordPageWithFreeSpace(rnode, newblk, freespace);
+		XLogRecordPageWithFreeSpace(rnode, newblk, freespace,
+									HEAP_FSM_CREATION_THRESHOLD);
 }
 
 static void
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 69a7a23874..863a3ce27f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -253,7 +253,8 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 		 * first new page.
 		 */
 		RecordPageWithFreeSpace(relation, blockNum, freespace,
-								firstBlock + extraBlocks);
+								firstBlock + extraBlocks,
+								HEAP_FSM_CREATION_THRESHOLD);
 	}
 	while (--extraBlocks > 0);
 
@@ -390,9 +391,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		 * We have no cached target page, so ask the FSM for an initial
 		 * target.
 		 */
-		targetBlock = GetPageWithFreeSpace(relation,
-										   len + saveFreeSpace,
-										   false);
+		targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+		if (targetBlock == InvalidBlockNumber)
+			targetBlock = GetAlternatePage(relation,
+										   HEAP_FSM_CREATION_THRESHOLD);
 	}
 
 loop:
@@ -535,7 +538,8 @@ loop:
 		targetBlock = RecordAndGetPageWithFreeSpace(relation,
 													targetBlock,
 													pageFreeSpace,
-													len + saveFreeSpace);
+													len + saveFreeSpace,
+													HEAP_FSM_CREATION_THRESHOLD);
 	}
 
 	/*
@@ -565,12 +569,9 @@ loop:
 
 			/*
 			 * Check if some other backend has extended a block for us while
-			 * we were waiting on the lock.  We only check the FSM -- if there
-			 * isn't one we don't recheck the number of blocks.
+			 * we were waiting on the lock.
 			 */
-			targetBlock = GetPageWithFreeSpace(relation,
-											   len + saveFreeSpace,
-											   true);
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
 
 			/*
 			 * If some other waiter has already extended the relation, we
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8dc76fa858..f064b1502a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -927,7 +927,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					Size		freespace;
 
 					freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-					RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks);
+					RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks,
+											HEAP_FSM_CREATION_THRESHOLD);
 				}
 			}
 			continue;
@@ -971,7 +972,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks);
+			RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks,
+									HEAP_FSM_CREATION_THRESHOLD);
 			continue;
 		}
 
@@ -1395,7 +1397,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * taken if there are no indexes.)
 		 */
 		if (vacrelstats->num_dead_tuples == prev_dead_count)
-			RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks);
+			RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks,
+									HEAP_FSM_CREATION_THRESHOLD);
 	}
 
 	/* No dead tuples should be left if index cleanup is enabled */
@@ -1580,7 +1583,8 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats, BlockNumber nblocks)
 		freespace = PageGetHeapFreeSpace(page);
 
 		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace, nblocks);
+		RecordPageWithFreeSpace(onerel, tblk, freespace, nblocks,
+								HEAP_FSM_CREATION_THRESHOLD);
 		npages++;
 	}
 
diff --git a/src/backend/storage/freespace/README b/src/backend/storage/freespace/README
index 0d3cd29772..64b1a34a5c 100644
--- a/src/backend/storage/freespace/README
+++ b/src/backend/storage/freespace/README
@@ -13,7 +13,7 @@ fixed-size FSM.  There are two exceptions:
 1. Hash indexes never have a FSM.
 2. For very small tables, a 3-page relation fork would be relatively large
 and wasteful, so to save space we refrain from creating the FSM if the
-heap has HEAP_FSM_CREATION_THRESHOLD pages or fewer.
+heap has less than or equal to some threshold number of pages.
 
 To locate free space in the latter case, we simply try pages directly without
 knowing ahead of time how much free space they have.  To maintain good
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c3ed4242e2..164e70791f 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -76,14 +76,6 @@
 #define FSM_ROOT_LEVEL	(FSM_TREE_DEPTH - 1)
 #define FSM_BOTTOM_LEVEL 0
 
-/* Status codes for the local map. */
-
-/* Either already tried, or beyond the end of the relation */
-#define FSM_LOCAL_NOT_AVAIL 0x00
-
-/* Available to try */
-#define FSM_LOCAL_AVAIL		0x01
-
 /*
  * The internal FSM routines work on a logical addressing scheme. Each
  * level of the tree can be thought of as a separately addressable file.
@@ -110,18 +102,15 @@ static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
 typedef struct
 {
 	BlockNumber nblocks;
-	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
+	uint32		map;
 }			FSMLocalMap;
 
-static FSMLocalMap fsm_local_map =
-{
-	0,
-	{
-		FSM_LOCAL_NOT_AVAIL
-	}
-};
+static FSMLocalMap fsm_local_map = { 0, 0 };
 
 #define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)
+#define FSM_LOCAL_MAP_IS_AVAIL(page) (fsm_local_map.map & (1 << page))
+#define FSM_LOCAL_MAP_SET_AVAIL(page) (fsm_local_map.map |= (1 << page))
+#define FSM_LOCAL_MAP_SET_UNAVAIL(page) (fsm_local_map.map &= ~(1 << page))
 
 /* functions to navigate the tree */
 static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
@@ -148,7 +137,8 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
 				BlockNumber start, BlockNumber end,
 				bool *eof);
 static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
-				 BlockNumber nblocks, BlockNumber *get_nblocks);
+				 BlockNumber nblocks, BlockNumber *get_nblocks,
+				 BlockNumber FSMCreationThreshold);
 
 
 /******** Public API ********/
@@ -164,44 +154,46 @@ static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
  * gets a lock on it.  In that case, the caller should report the actual
  * amount of free space available on that page and then try again (see
  * RecordAndGetPageWithFreeSpace).  If InvalidBlockNumber is returned,
- * extend the relation.
+ * either find a block some other way or extend the relation.
+ */
+BlockNumber
+GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
+{
+	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+
+	return fsm_search(rel, min_cat);
+}
+
+/*
+ * If the FSM knows nothing of the rel, try some other page before
+ * we give up and extend.
  *
  * For very small heap relations that don't have a FSM, we try every other
  * page before extending the relation.  To keep track of which pages have
  * been tried, initialize a local in-memory map of pages.
  */
 BlockNumber
-GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
+GetAlternatePage(Relation rel, BlockNumber FSMCreationThreshold)
 {
-	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
-	BlockNumber target_block,
+	BlockNumber target_block = InvalidBlockNumber,
 				nblocks;
 
-	/* First try the FSM, if it exists. */
-	target_block = fsm_search(rel, min_cat);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
-	if (target_block == InvalidBlockNumber &&
-		(rel->rd_rel->relkind == RELKIND_RELATION ||
-		 rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-		!check_fsm_only)
+	if (nblocks > FSMCreationThreshold)
 	{
-		nblocks = RelationGetNumberOfBlocks(rel);
-
-		if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
-		{
-			/*
-			 * If the FSM knows nothing of the rel, try the last page before
-			 * we give up and extend.  This avoids one-tuple-per-page syndrome
-			 * during bootstrapping or in a recently-started system.
-			 */
-			target_block = nblocks - 1;
-		}
-		else if (nblocks > 0)
-		{
-			/* Initialize local map and get first candidate block. */
-			fsm_local_set(rel, nblocks);
-			target_block = fsm_local_search();
-		}
+		/*
+		 * If the FSM knows nothing of the rel, try the last page before
+		 * we give up and extend.  This avoids one-tuple-per-page syndrome
+		 * during bootstrapping or in a recently-started system.
+		 */
+		target_block = nblocks - 1;
+	}
+	else if (nblocks > 0)
+	{
+		/* Initialize local map and get first candidate block. */
+		fsm_local_set(rel, nblocks);
+		target_block = fsm_local_search();
 	}
 
 	return target_block;
@@ -221,7 +213,8 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
  */
 BlockNumber
 RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
-							  Size oldSpaceAvail, Size spaceNeeded)
+							  Size oldSpaceAvail, Size spaceNeeded,
+							  BlockNumber FSMCreationThreshold)
 {
 	int			old_cat;
 	int			search_cat;
@@ -233,15 +226,14 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 	/* First try the local map, if it exists. */
 	if (FSM_LOCAL_MAP_EXISTS)
 	{
-		Assert((rel->rd_rel->relkind == RELKIND_RELATION ||
-				rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-			   fsm_local_map.map[oldPage] == FSM_LOCAL_AVAIL);
+		Assert(FSM_LOCAL_MAP_IS_AVAIL(oldPage));
 
-		fsm_local_map.map[oldPage] = FSM_LOCAL_NOT_AVAIL;
+		FSM_LOCAL_MAP_SET_UNAVAIL(oldPage);
 		return fsm_local_search();
 	}
 
-	if (!fsm_allow_writes(rel, oldPage, InvalidBlockNumber, &nblocks))
+	if (!fsm_allow_writes(rel, oldPage, InvalidBlockNumber,
+						  &nblocks, FSMCreationThreshold))
 	{
 		/*
 		 * If we have neither a local map nor a FSM, we probably just tried
@@ -278,19 +270,18 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
  * Note that if the new spaceAvail value is higher than the old value stored
  * in the FSM, the space might not become visible to searchers until the next
  * FreeSpaceMapVacuum call, which updates the upper level pages.
- *
- * Callers have no need for a local map.
  */
 void
 RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
-						Size spaceAvail, BlockNumber nblocks)
+						Size spaceAvail, BlockNumber nblocks,
+						BlockNumber FSMCreationThreshold)
 {
 	int			new_cat;
 	FSMAddress	addr;
 	uint16		slot;
 	BlockNumber dummy;
 
-	if (!fsm_allow_writes(rel, heapBlk, nblocks, &dummy))
+	if (!fsm_allow_writes(rel, heapBlk, nblocks, &dummy, FSMCreationThreshold))
 		/* No FSM to update and no local map either */
 		return;
 
@@ -311,8 +302,7 @@ FSMClearLocalMap(void)
 	if (FSM_LOCAL_MAP_EXISTS)
 	{
 		fsm_local_map.nblocks = 0;
-		memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
-			   sizeof(fsm_local_map.map));
+		fsm_local_map.map = 0;
 	}
 }
 
@@ -322,7 +312,7 @@ FSMClearLocalMap(void)
  */
 void
 XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
-							Size spaceAvail)
+							Size spaceAvail, BlockNumber FSMCreationThreshold)
 {
 	int			new_cat = fsm_space_avail_to_cat(spaceAvail);
 	FSMAddress	addr;
@@ -333,7 +323,7 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	bool		write_to_fsm;
 
 	/* This is meant to mirror the logic in fsm_allow_writes() */
-	if (heapBlk >= HEAP_FSM_CREATION_THRESHOLD)
+	if (heapBlk >= FSMCreationThreshold)
 		write_to_fsm = true;
 	else
 	{
@@ -343,14 +333,7 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 		if (smgrexists(smgr, FSM_FORKNUM))
 			write_to_fsm = true;
 		else
-		{
-			BlockNumber heap_nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
-
-			if (heap_nblocks > HEAP_FSM_CREATION_THRESHOLD)
-				write_to_fsm = true;
-			else
-				write_to_fsm = false;
-		}
+			write_to_fsm = false;
 	}
 
 	if (!write_to_fsm)
@@ -1058,7 +1041,7 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
 
 /*
  * For heaps, we prevent creation of the FSM unless the number of pages
- * exceeds HEAP_FSM_CREATION_THRESHOLD.  For tables that don't already have
+ * exceeds the given threshold.  For tables that don't already have
  * a FSM, this will save an inode and a few kB of space.
  *
  * XXX The API is a little awkward -- if the caller passes a valid nblocks
@@ -1069,16 +1052,12 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
  */
 static bool
 fsm_allow_writes(Relation rel, BlockNumber heapblk,
-				 BlockNumber nblocks, BlockNumber *get_nblocks)
+				 BlockNumber nblocks, BlockNumber *get_nblocks,
+				 BlockNumber FSMCreationThreshold)
 {
 	bool		skip_get_nblocks;
 
-	if (heapblk >= HEAP_FSM_CREATION_THRESHOLD)
-		return true;
-
-	/* Non-heap rels can always create a FSM. */
-	if (rel->rd_rel->relkind != RELKIND_RELATION &&
-		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+	if (heapblk >= FSMCreationThreshold)
 		return true;
 
 	/*
@@ -1089,7 +1068,7 @@ fsm_allow_writes(Relation rel, BlockNumber heapblk,
 	 */
 	if (nblocks != InvalidBlockNumber)
 	{
-		if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
+		if (nblocks > FSMCreationThreshold)
 			return true;
 		else
 			skip_get_nblocks = true;
@@ -1097,7 +1076,7 @@ fsm_allow_writes(Relation rel, BlockNumber heapblk,
 	else
 	{
 		if (rel->rd_rel->relpages != InvalidBlockNumber &&
-			rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
+			rel->rd_rel->relpages > FSMCreationThreshold)
 			return true;
 		else
 			skip_get_nblocks = false;
@@ -1112,7 +1091,7 @@ fsm_allow_writes(Relation rel, BlockNumber heapblk,
 
 	/* last resort */
 	*get_nblocks = RelationGetNumberOfBlocks(rel);
-	if (*get_nblocks > HEAP_FSM_CREATION_THRESHOLD)
+	if (*get_nblocks > FSMCreationThreshold)
 		return true;
 	else
 		return false;
@@ -1135,6 +1114,9 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 	/* The local map must not be set already. */
 	Assert(!FSM_LOCAL_MAP_EXISTS);
 
+	/* The relation's number of blocks must fit in the local map */
+	Assert(cur_nblocks <= sizeof(fsm_local_map.map));
+
 	/*
 	 * Starting at the current last block in the relation and working
 	 * backwards, mark alternating blocks as available.
@@ -1142,7 +1124,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 	blkno = cur_nblocks - 1;
 	while (true)
 	{
-		fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+		FSM_LOCAL_MAP_SET_AVAIL(blkno);
 		if (blkno >= 2)
 			blkno -= 2;
 		else
@@ -1156,7 +1138,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 	cached_target_block = RelationGetTargetBlock(rel);
 	if (cached_target_block != InvalidBlockNumber &&
 		cached_target_block < cur_nblocks)
-		fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+		FSM_LOCAL_MAP_SET_UNAVAIL(cached_target_block);
 }
 
 /*
@@ -1179,7 +1161,7 @@ fsm_local_search(void)
 	do
 	{
 		target_block--;
-		if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+		if (FSM_LOCAL_MAP_IS_AVAIL(target_block))
 			return target_block;
 	} while (target_block > 0);
 
diff --git a/src/backend/storage/freespace/indexfsm.c b/src/backend/storage/freespace/indexfsm.c
index 9d8f43d373..50bcafb6a5 100644
--- a/src/backend/storage/freespace/indexfsm.c
+++ b/src/backend/storage/freespace/indexfsm.c
@@ -25,6 +25,9 @@
 #include "storage/freespace.h"
 #include "storage/indexfsm.h"
 
+/* Indexes using these routines can always create a FSM. */
+#define INDEX_FSM_CREATION_THRESHOLD 0
+
 /*
  * Exported routines
  */
@@ -37,7 +40,7 @@
 BlockNumber
 GetFreeIndexPage(Relation rel)
 {
-	BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2, true);
+	BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2);
 
 	if (blkno != InvalidBlockNumber)
 		RecordUsedIndexPage(rel, blkno);
@@ -51,7 +54,8 @@ GetFreeIndexPage(Relation rel)
 void
 RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
 {
-	RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1, InvalidBlockNumber);
+	RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1, InvalidBlockNumber,
+							INDEX_FSM_CREATION_THRESHOLD);
 }
 
 
@@ -61,8 +65,8 @@ RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
 void
 RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
 {
-	RecordPageWithFreeSpace(rel, usedBlock, 0, InvalidBlockNumber);
-}
+	RecordPageWithFreeSpace(rel, usedBlock, 0, InvalidBlockNumber,
+							INDEX_FSM_CREATION_THRESHOLD);}
 
 /*
  * IndexFreeSpaceMapVacuum - scan and fix any inconsistencies in the FSM
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index dd3c8cefe4..e3c844acd9 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -13,8 +13,9 @@
 
 #include <sys/stat.h>
 #include "catalog/pg_class_d.h"
+#include "access/heap_fe.h"
 #include "access/transam.h"
-#include "storage/freespace.h"
+#include "storage/block.h"
 
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
@@ -293,6 +294,7 @@ new_cluster_needs_fsm(FileNameMap *map)
 {
 	char		old_primary_file[MAXPGPATH];
 	struct stat statbuf;
+	BlockNumber FSMCreationThreshold;
 
 	/* fsm/vm files added in PG 8.4 */
 	Assert(GET_MAJOR_VERSION(old_cluster.major_version) >= 804);
@@ -301,9 +303,12 @@ new_cluster_needs_fsm(FileNameMap *map)
 		  GET_MAJOR_VERSION(new_cluster.major_version) >= 1200))
 		return true;
 
-	/* Always transfer FSMs of non-heap relations. */
-	if (map->relkind != RELKIND_RELATION &&
-		map->relkind != RELKIND_TOASTVALUE)
+	/* If other table types need this treatment, include their threshold. */
+	if (map->relkind == RELKIND_RELATION ||
+		map->relkind == RELKIND_TOASTVALUE)
+		FSMCreationThreshold = HEAP_FSM_CREATION_THRESHOLD;
+	else
+		/* Always transfer FSMs of other kinds of relations. */
 		return true;
 
 	/*
@@ -311,7 +316,7 @@ new_cluster_needs_fsm(FileNameMap *map)
 	 * threshold, we will transfer a FSM when we don't need to, but this is
 	 * harmless.
 	 */
-	if (map->relpages > HEAP_FSM_CREATION_THRESHOLD)
+	if (map->relpages > FSMCreationThreshold)
 		return true;
 
 	/* Determine path of the primary file. */
@@ -334,7 +339,7 @@ new_cluster_needs_fsm(FileNameMap *map)
 		/* Keep compiler quiet. */
 		return false;
 	}
-	else if (statbuf.st_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ)
+	else if (statbuf.st_size > FSMCreationThreshold * BLCKSZ)
 		return true;
 	else
 		return false;
diff --git a/src/include/access/brin_pageops.h b/src/include/access/brin_pageops.h
index 30d1761fc7..aff78c059b 100644
--- a/src/include/access/brin_pageops.h
+++ b/src/include/access/brin_pageops.h
@@ -13,6 +13,9 @@
 
 #include "access/brin_revmap.h"
 
+/* BRIN indexes can always create a FSM. */
+#define BRIN_FSM_CREATION_THRESHOLD 0
+
 extern bool brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 			  BrinRevmap *revmap, BlockNumber heapBlk,
 			  Buffer oldbuf, OffsetNumber oldoff,
diff --git a/src/include/access/heap_fe.h b/src/include/access/heap_fe.h
new file mode 100644
index 0000000000..558c6e5a33
--- /dev/null
+++ b/src/include/access/heap_fe.h
@@ -0,0 +1,20 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_fe.h
+ *	  POSTGRES heap access method constants used by front-end code.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/heap_fe.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef HEAP_FE_H
+#define HEAP_FE_H
+
+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4
+
+#endif							/* HEAP_FE_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 77e5e603b0..1356d4688e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -14,6 +14,7 @@
 #ifndef HEAPAM_H
 #define HEAPAM_H
 
+#include "access/heap_fe.h"
 #include "access/relation.h"	/* for backward compatibility */
 #include "access/relscan.h"
 #include "access/sdir.h"
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index dbaae651c5..3b433cf4f6 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -18,22 +18,21 @@
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
 
-/* Only create the FSM if the heap has greater than this many blocks */
-#define HEAP_FSM_CREATION_THRESHOLD 4
-
 /* prototypes for public functions in freespace.c */
 extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
-					 bool check_fsm_only);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetAlternatePage(Relation rel, BlockNumber FSMCreationThreshold);
 extern BlockNumber RecordAndGetPageWithFreeSpace(Relation rel,
 							  BlockNumber oldPage,
 							  Size oldSpaceAvail,
-							  Size spaceNeeded);
+							  Size spaceNeeded,
+							  BlockNumber FSMCreationThreshold);
 extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
-						Size spaceAvail, BlockNumber nblocks);
+						Size spaceAvail, BlockNumber nblocks,
+						BlockNumber FSMCreationThreshold);
 extern void FSMClearLocalMap(void);
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
-							Size spaceAvail);
+							Size spaceAvail, BlockNumber FSMCreationThreshold);
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);

#33

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#31)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 25, 2019 at 12:39 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:

On Wed, Apr 24, 2019 at 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

<v2 patch>

Sorry for not noticing earlier, but this patch causes a regression
test failure for me (attached)

Can you please try to finish the remaining work of the patch (I am bit
tied up with some other things)? I think the main thing apart from
representation of map as one-byte or one-bit per block is to implement
invalidation. Also, try to see if there is anything pending which I
might have missed. As discussed above, we need to issue an
invalidation for following points: (a) when vacuum finds there is no
FSM and page has more space now, I think you can detect this in
RecordPageWithFreeSpace (b) invalidation to notify the existence of
FSM, this can be done both by vacuum and backend.

Yes, I'll work on it.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#31)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed above, we need to issue an
invalidation for following points: (a) when vacuum finds there is no
FSM and page has more space now, I think you can detect this in
RecordPageWithFreeSpace

I took a brief look and we'd have to know how much space was there
before. That doesn't seem possible without first implementing the idea
to save free space locally in the same way the FSM does. Even if we
have consensus on that, there's no code for it, and we're running out
of time.

(b) invalidation to notify the existence of
FSM, this can be done both by vacuum and backend.

I still don't claim to be anything but naive in this area, but does
the attached get us any closer?

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

clear-upon-fsm-creation.patchapplication/octet-stream; name=clear-upon-fsm-creation.patchDownload

diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 130649601c..e0f79e3522 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -30,6 +30,7 @@
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "utils/inval.h"
 #include "utils/memutils.h"
 
 
@@ -167,7 +168,6 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
 				 * during bootstrapping or in a recently-started system.
 				 */
 				target_block = nblocks - 1;
-				fsm_clear_local_map(rel);
 				return target_block;
 			}
 			else if (nblocks > 0)
@@ -776,7 +776,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
 		 rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
 		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	{
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
+		fsm_clear_local_map(rel);
+	}
 
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
@@ -1175,5 +1178,8 @@ fsm_clear_local_map(Relation rel)
 		rel->fsm_local_map->nblocks = 0;
 		memset(&rel->fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
 			sizeof(rel->fsm_local_map->map));
+
+		/* Notify other backends that their local maps out of date. */
+		CacheInvalidateRelcache(rel);
 	}
 }

#35

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: John Naylor (#34)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On 2019-Apr-30, John Naylor wrote:

On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed above, we need to issue an
invalidation for following points: (a) when vacuum finds there is no
FSM and page has more space now, I think you can detect this in
RecordPageWithFreeSpace

I took a brief look and we'd have to know how much space was there
before. That doesn't seem possible without first implementing the idea
to save free space locally in the same way the FSM does. Even if we
have consensus on that, there's no code for it, and we're running out
of time.

Hmm ... so, if vacuum runs and frees up any space from any of the pages,
then it should send out an invalidation -- it doesn't matter what the
FSM had, just that there is more free space now. That means every other
process will need to determine a fresh FSM, but that seems correct.
Sounds better than keeping outdated entries indicating no-space-available.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#36

amit.kapila16@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#35)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Apr-30, John Naylor wrote:

On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed above, we need to issue an
invalidation for following points: (a) when vacuum finds there is no
FSM and page has more space now, I think you can detect this in
RecordPageWithFreeSpace

I took a brief look and we'd have to know how much space was there
before. That doesn't seem possible without first implementing the idea
to save free space locally in the same way the FSM does. Even if we
have consensus on that, there's no code for it, and we're running out
of time.

Hmm ... so, if vacuum runs and frees up any space from any of the pages,
then it should send out an invalidation -- it doesn't matter what the
FSM had, just that there is more free space now. That means every other
process will need to determine a fresh FSM,

I think you intend to say the local space map because once FSM is
created we will send invalidation and we won't further build relcache
entry having local space map.

but that seems correct.
Sounds better than keeping outdated entries indicating no-space-available.

Agreed, but as mentioned in one of the above emails, I am also bit
scared that it should not lead to many invalidation messages for small
relations, so may be we should send the invalidation message only when
the entire page is empty.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#37

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Andres Freund (#8)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Tue, Apr 30, 2019 at 6:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 30, 2019 at 2:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

insert into atacc1 values (21, 22, 23);
+ERROR: could not read block 0 in file "base/16384/31379": read only
0 of 8192 bytes

I have analysed this failure. Seems that we have not reset the
rel->fsm_local_map while truncating the relation pages by vacuum
(lazy_truncate_heap). So when next time while accessing it we are
getting the error. I think we need a mechanism to invalidate this
when we truncate the relation pages. I am not sure whether we should
invalidate the relcache entry here or just reset the
rel->fsm_local_map?

Thanks, this appears to be the missing case where we need to
invalidate the cache. So, as discussed above if we issue invalidation
call (in RecordPageWithFreeSpace) when the page becomes empty, then we
shouldn't encounter this. John, can we try this out and see if the
failure goes away?

I added a clear/inval call in RecordPageWithFreeSpace and the failure
goes away. Thanks for the analysis, Dilip!

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Import Notes

Reply to msg id not found: CAA4eK1JBcL1+MZ_PNq+ZHz0nL6k8S15f+pJQ0kLgAusGhVrQ@mail.gmail.com

#38

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Andres Freund (#1)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Tue, Apr 30, 2019 at 12:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Apr 26, 2019 at 10:46 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.

I took a stab at untying the free space code from any knowledge about
heaps, and made it the responsibility of each access method that calls
these free space routines to specify their own threshold (possibly
zero). The attached applies on top of HEAD, because it's independent
of the relcache logic being discussed. If I can get that working, the
I'll rebase it on top of this API, if you like.
extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+                    bool check_fsm_only);
So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.
I split this up into 2 routines: GetPageWithFreeSpace() is now exactly
like it is in v11, and GetAlternatePage() is available for access
methods that can use it.
I don't much like the new function name GetAlternatePage, may be
GetPageFromLocalFSM or something like that. OTOH, I am not sure if we
should go that far to address this concern of Andres's, maybe just
adding a proper comment is sufficient.

That's a clearer name. I think 2 functions is easier to follow than
the boolean parameter.

Putting the thresholds in 3 files with completely different purposes
is a mess, and serves no example for future access methods, but I
don't have a better idea.

Yeah, I am also not sure if it is a good idea because it still won't
be easy for pluggable storage especially the pg_upgrade part. I think
if we really want to make it easy for pluggable storage to define
this, then we might need to build something along the lines of how to
estimate relation size works.

See how table_relation_estimate_size is defined and used
and TableAmRoutine heapam_methods
{
..
relation_estimate_size
}

That might be the best way for table ams, but I guess we'd still need
to keep the hard-coding for indexes to always have a FSM. That might
not be too bad.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Import Notes

Reply to msg id not found: CAA4eK1LntzX_vAtDiXLB93WULrjC14WfXbgL1-nM576ETCBZHw@mail.gmail.com

#39

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#38)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, May 1, 2019 at 9:57 AM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Tue, Apr 30, 2019 at 12:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Apr 26, 2019 at 10:46 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
I don't much like the new function name GetAlternatePage, may be
GetPageFromLocalFSM or something like that. OTOH, I am not sure if we
should go that far to address this concern of Andres's, maybe just
adding a proper comment is sufficient.

That's a clearer name. I think 2 functions is easier to follow than
the boolean parameter.

Okay, but then add a few comments where you are calling that function.

Putting the thresholds in 3 files with completely different purposes
is a mess, and serves no example for future access methods, but I
don't have a better idea.

Yeah, I am also not sure if it is a good idea because it still won't
be easy for pluggable storage especially the pg_upgrade part. I think
if we really want to make it easy for pluggable storage to define
this, then we might need to build something along the lines of how to
estimate relation size works.

See how table_relation_estimate_size is defined and used
and TableAmRoutine heapam_methods
{
..
relation_estimate_size
}

That might be the best way for table ams, but I guess we'd still need
to keep the hard-coding for indexes to always have a FSM. That might
not be too bad.

I also think so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#40

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#36)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, May 1, 2019 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

but that seems correct.
Sounds better than keeping outdated entries indicating no-space-available.

Agreed, but as mentioned in one of the above emails, I am also bit
scared that it should not lead to many invalidation messages for small
relations, so may be we should send the invalidation message only when
the entire page is empty.

One way would be to send the inval if the new free space is greater
than some percentage of BLCKSZ.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#41

andres@anarazel.de

over 6 years ago

In reply to: Andres Freund (#13)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-04-18 14:10:29 -0700, Andres Freund wrote:

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

I think decision time has come. My tentative impression is that we're
not there yet, and should revert the improvements in v12, and apply the
improved version early in v13. As a second choice, we should live with
the current approach, if John and Amit "promise" further effort to clean
this up for v13.

Greetings,

Andres Freund

#42

Bruce Momjian

bruce@momjian.us

over 6 years ago

In reply to: Andres Freund (#41)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, May 1, 2019 at 08:24:25AM -0700, Andres Freund wrote:

Hi,

On 2019-04-18 14:10:29 -0700, Andres Freund wrote:

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

I think decision time has come. My tentative impression is that we're
not there yet, and should revert the improvements in v12, and apply the
improved version early in v13. As a second choice, we should live with
the current approach, if John and Amit "promise" further effort to clean
this up for v13.

My ignorant opinion is that I have been surprised by the churn caused by
this change, and have therefore questioned the value of it. Frankly,
there has been so much churn I am unclear if it can be easily reverted.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#43

andres@anarazel.de

over 6 years ago

In reply to: Bruce Momjian (#42)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-05-01 11:28:11 -0400, Bruce Momjian wrote:

On Wed, May 1, 2019 at 08:24:25AM -0700, Andres Freund wrote:

Hi,

On 2019-04-18 14:10:29 -0700, Andres Freund wrote:

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

I think decision time has come. My tentative impression is that we're
not there yet, and should revert the improvements in v12, and apply the
improved version early in v13. As a second choice, we should live with
the current approach, if John and Amit "promise" further effort to clean
this up for v13.

My ignorant opinion is that I have been surprised by the churn caused by
this change, and have therefore questioned the value of it.

Hm, I don't think there has been that much churn? Sure, there was a
revert to figure out a regression test instability, but that doesn't
seem that bad. Relevant commits in date order are:

andres-classification: cleanup
commit 06c8a5090ed9ec188557a86d4de11384f5128ec0
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-03-16 06:55:56 +0530

Improve code comments in b0eaa4c51b.

Author: John Naylor
Discussion: /messages/by-id/CACPNZCswjyGJxTT=mxHgK=Z=mJ9uJ4WEx_UO=bNwpR_i0EaHHg@mail.gmail.com

andres-classification: incremental improvement
commit 13e8643bfc29d3c1455c0946281cdfc24758ffb6
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-03-15 08:25:57 +0530

During pg_upgrade, conditionally skip transfer of FSMs.

andres-classification: additional tests
commit 6f918159a97acf76ee2512e44f5ed5dcaaa0d923
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-03-12 08:14:28 +0530

Add more tests for FSM.

andres-classification: cleanup
commit a6e48da08844eeb5a72c8b59dad3aaab6e891fac
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-03-11 08:16:14 +0530

Fix typos in commit 8586bf7ed8.

andres-classification: bugfix
commit 9c32e4c35026bd52aaf340bfe7594abc653e42f0
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-03-01 07:38:47 +0530

Clear the local map when not used.

andres-classification: docs addition
commit 29d108cdecbe918452e70041d802cc515b2d56b8
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-02-20 17:37:39 +0530

Doc: Update the documentation for FSM behavior for small tables.

andres-classification: regression test stability
commit 08ecdfe7e5e0a31efbe1d58fefbe085b53bc79ca
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-02-04 10:08:29 +0530

Make FSM test portable.

andres-classification: feature
commit b0eaa4c51bbff3e3c600b11e5d104d6feb9ca77f
Author: Amit Kapila <akapila@postgresql.org>
Date: 2019-02-04 07:49:15 +0530

Avoid creation of the free space map for small heap relations, take 2.

So sure, there's a few typo fixes, one bugfix, and one buildfarm test
stability issue. Doesn't seem crazy for a nontrivial improvement.

Frankly, there has been so much churn I am unclear if it can be easily reverted.

Doesn't seem that hard? There's some minor conflicts, but nothing bad?

Greetings,

Andres Freund

#44

Bruce Momjian

bruce@momjian.us

over 6 years ago

In reply to: Andres Freund (#43)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, May 1, 2019 at 09:08:54AM -0700, Andres Freund wrote:

So sure, there's a few typo fixes, one bugfix, and one buildfarm test
stability issue. Doesn't seem crazy for a nontrivial improvement.

OK, my ignorant opinion was just based on the length of discussion
threads.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#45

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#36)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On 2019-May-01, Amit Kapila wrote:

On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Hmm ... so, if vacuum runs and frees up any space from any of the pages,
then it should send out an invalidation -- it doesn't matter what the
FSM had, just that there is more free space now. That means every other
process will need to determine a fresh FSM,

I think you intend to say the local space map because once FSM is
created we will send invalidation and we won't further build relcache
entry having local space map.

Yeah, I mean the map that records free space.

but that seems correct. Sounds better than keeping outdated entries
indicating no-space-available.

Agreed, but as mentioned in one of the above emails, I am also bit
scared that it should not lead to many invalidation messages for small
relations, so may be we should send the invalidation message only when
the entire page is empty.

I don't think that's a concern, is it? You typically won't be running
multiple vacuums per second, or even multiple vacuums per minute.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#46

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Andres Freund (#41)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Wed, May 1, 2019 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-18 14:10:29 -0700, Andres Freund wrote:

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

I think decision time has come. My tentative impression is that we're
not there yet, and should revert the improvements in v12, and apply the
improved version early in v13. As a second choice, we should live with
the current approach, if John and Amit "promise" further effort to clean
this up for v13.

Yes, the revised approach is not currently as mature as the one in
HEAD. It's not ready. Not wanting to attempt Promise Driven
Development, I'd rather revert, and only try again if there's enough
time and interest.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#47

amit.kapila16@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#45)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, May 2, 2019 at 3:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-May-01, Amit Kapila wrote:

On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Hmm ... so, if vacuum runs and frees up any space from any of the pages,
then it should send out an invalidation -- it doesn't matter what the
FSM had, just that there is more free space now. That means every other
process will need to determine a fresh FSM,

I think you intend to say the local space map because once FSM is
created we will send invalidation and we won't further build relcache
entry having local space map.

Yeah, I mean the map that records free space.

but that seems correct. Sounds better than keeping outdated entries
indicating no-space-available.

Agreed, but as mentioned in one of the above emails, I am also bit
scared that it should not lead to many invalidation messages for small
relations, so may be we should send the invalidation message only when
the entire page is empty.

I don't think that's a concern, is it? You typically won't be running
multiple vacuums per second, or even multiple vacuums per minute.

That's right. So let's try by adding invalidation call whenever space
is reduced. Is there a good way to test whether the new invalidation
calls added by this patch has any significant impact?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#48

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#46)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, May 2, 2019 at 7:36 AM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Wed, May 1, 2019 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-18 14:10:29 -0700, Andres Freund wrote:

My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.

I think decision time has come. My tentative impression is that we're
not there yet,

You are right that patch is not in committable shape, but the patch to
move the map to relcache is presented and the main work left there is
to review/test and add the invalidation calls as per discussion. It
is just that I don't want to that in haste leading to some other
problems. So, that patch should not take too much time and will
resolve the main complaint. Basically, I was planning to re-post that
patch as the discussion concludes between me and Alvaro and then
probably you can also look into it once to see if that addresses the
main complaint. There are a few other points for which John has
prepared a patch and that might need some work based on your inputs.

and should revert the improvements in v12, and apply the
improved version early in v13. As a second choice, we should live with
the current approach, if John and Amit "promise" further effort to clean
this up for v13.

Yes, the revised approach is not currently as mature as the one in
HEAD. It's not ready. Not wanting to attempt Promise Driven
Development, I'd rather revert, and only try again if there's enough
time and interest.

I can certainly help with moving patch (for cleanup) forward.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#49

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#34)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Tue, Apr 30, 2019 at 11:42 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:

On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed above, we need to issue an
invalidation for following points: (a) when vacuum finds there is no
FSM and page has more space now, I think you can detect this in
RecordPageWithFreeSpace

I took a brief look and we'd have to know how much space was there
before. That doesn't seem possible without first implementing the idea
to save free space locally in the same way the FSM does. Even if we
have consensus on that, there's no code for it, and we're running out
of time.

(b) invalidation to notify the existence of
FSM, this can be done both by vacuum and backend.

I still don't claim to be anything but naive in this area, but does
the attached get us any closer?

@@ -776,7 +776,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
  if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
  rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
  !smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ {
  smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
+ fsm_clear_local_map(rel);
+ }

I think this won't be correct because when we call fsm_extend via
vacuum the local map won't be already existing, so it won't issue an
invalidation call. Isn't it better to directly call
CacheInvalidateRelcache here to notify other backends that their local
maps are invalid now?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#50

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#49)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, May 2, 2019 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

@@ -776,7 +776,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ {
smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
+ fsm_clear_local_map(rel);
+ }
I think this won't be correct because when we call fsm_extend via
vacuum the local map won't be already existing, so it won't issue an
invalidation call. Isn't it better to directly call
CacheInvalidateRelcache here to notify other backends that their local
maps are invalid now?

Yes, you're quite correct.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#51

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#50)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, May 2, 2019 at 12:39 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Thu, May 2, 2019 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
@@ -776,7 +776,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ {
smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
+ fsm_clear_local_map(rel);
+ }
I think this won't be correct because when we call fsm_extend via
vacuum the local map won't be already existing, so it won't issue an
invalidation call. Isn't it better to directly call
CacheInvalidateRelcache here to notify other backends that their local
maps are invalid now?
Yes, you're quite correct.

Okay, I have updated the patch to incorporate your changes and call
relcache invalidation at required places. I have updated comments at a
few places as well. The summarization of this patch is that (a) it
moves the local map to relation cache (b) performs the cache
invalidation whenever we create fsm (either via backend or vacuum),
when some space in a page is freed by vacuum (provided fsm doesn't
exist) or whenever the local map is cleared (c) additionally, we clear
the local map when we found a block from FSM, when we have already
tried all the blocks present in cache or when we are allowed to create
FSM.

If we agree on this, then we can update the README accordingly.

Can you please test/review?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

store_local_map_relcache_v3.patchapplication/octet-stream; name=store_local_map_relcache_v3.patchDownload

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 69a7a23874..d854a84bf3 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -500,12 +500,6 @@ loop:
 			/* use this page as future insert target, too */
 			RelationSetTargetBlock(relation, targetBlock);
 
-			/*
-			 * In case we used an in-memory map of available blocks, reset it
-			 * for next use.
-			 */
-			FSMClearLocalMap();
-
 			return buffer;
 		}
 
@@ -675,8 +669,5 @@ loop:
 	 */
 	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
 
-	/* This should already be cleared by now, but make sure it is. */
-	FSMClearLocalMap();
-
 	return buffer;
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8522a2f51c..d14c41ddaa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2587,12 +2587,6 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	/* Clean up buffer I/O and buffer context locks, too */
 	AbortBufferIO();
 	UnlockBuffers();
@@ -4881,12 +4875,6 @@ AbortSubTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	AbortBufferIO();
 	UnlockBuffers();
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c3ed4242e2..46b95158db 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -30,6 +30,8 @@
 #include "storage/fsm_internals.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "utils/inval.h"
+#include "utils/memutils.h"
 
 
 /*
@@ -76,14 +78,6 @@
 #define FSM_ROOT_LEVEL	(FSM_TREE_DEPTH - 1)
 #define FSM_BOTTOM_LEVEL 0
 
-/* Status codes for the local map. */
-
-/* Either already tried, or beyond the end of the relation */
-#define FSM_LOCAL_NOT_AVAIL 0x00
-
-/* Available to try */
-#define FSM_LOCAL_AVAIL		0x01
-
 /*
  * The internal FSM routines work on a logical addressing scheme. Each
  * level of the tree can be thought of as a separately addressable file.
@@ -97,31 +91,7 @@ typedef struct
 /* Address of the root page. */
 static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
 
-/*
- * For small relations, we don't create FSM to save space, instead we use
- * local in-memory map of pages to try.  To locate free space, we simply try
- * pages directly without knowing ahead of time how much free space they have.
- *
- * Note that this map is used to the find the block with required free space
- * for any given relation.  We clear this map when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
- * See src/backend/storage/freespace/README for further details.
- */
-typedef struct
-{
-	BlockNumber nblocks;
-	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
-}			FSMLocalMap;
-
-static FSMLocalMap fsm_local_map =
-{
-	0,
-	{
-		FSM_LOCAL_NOT_AVAIL
-	}
-};
-
-#define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)
+#define FSM_LOCAL_MAP_EXISTS(rel) (rel->fsm_local_map->nblocks > 0)
 
 /* functions to navigate the tree */
 static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
@@ -143,12 +113,13 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static void fsm_local_set(Relation rel, BlockNumber cur_nblocks);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
-static BlockNumber fsm_local_search(void);
+static BlockNumber fsm_local_search(Relation rel);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
 				BlockNumber start, BlockNumber end,
 				bool *eof);
 static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
 				 BlockNumber nblocks, BlockNumber *get_nblocks);
+static void fsm_clear_local_map(Relation rel);
 
 
 /******** Public API ********/
@@ -185,24 +156,32 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
 		 rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
 		!check_fsm_only)
 	{
-		nblocks = RelationGetNumberOfBlocks(rel);
-
-		if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
-		{
-			/*
-			 * If the FSM knows nothing of the rel, try the last page before
-			 * we give up and extend.  This avoids one-tuple-per-page syndrome
-			 * during bootstrapping or in a recently-started system.
-			 */
-			target_block = nblocks - 1;
-		}
-		else if (nblocks > 0)
+		if (!FSM_LOCAL_MAP_EXISTS(rel))
 		{
-			/* Initialize local map and get first candidate block. */
-			fsm_local_set(rel, nblocks);
-			target_block = fsm_local_search();
+			nblocks = RelationGetNumberOfBlocks(rel);
+
+			if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
+			{
+				/*
+				 * If the FSM knows nothing of the rel, try the last page before
+				 * we give up and extend.  This avoids one-tuple-per-page syndrome
+				 * during bootstrapping or in a recently-started system.
+				 */
+				target_block = nblocks - 1;
+				return target_block;
+			}
+			else if (nblocks > 0)
+			{
+				/* Initialize the local map. */
+				fsm_local_set(rel, nblocks);
+				target_block = fsm_local_search(rel);
+			}
 		}
+		else
+			target_block = fsm_local_search(rel);
 	}
+	else
+		fsm_clear_local_map(rel);
 
 	return target_block;
 }
@@ -231,14 +210,14 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 	BlockNumber nblocks = InvalidBlockNumber;
 
 	/* First try the local map, if it exists. */
-	if (FSM_LOCAL_MAP_EXISTS)
+	if (FSM_LOCAL_MAP_EXISTS(rel))
 	{
 		Assert((rel->rd_rel->relkind == RELKIND_RELATION ||
 				rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-			   fsm_local_map.map[oldPage] == FSM_LOCAL_AVAIL);
+			   rel->fsm_local_map->map[oldPage] == FSM_LOCAL_AVAIL);
 
-		fsm_local_map.map[oldPage] = FSM_LOCAL_NOT_AVAIL;
-		return fsm_local_search();
+		rel->fsm_local_map->map[oldPage] = FSM_LOCAL_NOT_AVAIL;
+		return fsm_local_search(rel);
 	}
 
 	if (!fsm_allow_writes(rel, oldPage, InvalidBlockNumber, &nblocks))
@@ -249,8 +228,10 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 		 * need to create the local map.
 		 */
 		fsm_local_set(rel, nblocks);
-		return fsm_local_search();
+		return fsm_local_search(rel);
 	}
+	else
+		fsm_clear_local_map(rel);
 
 	/* Normal FSM logic follows */
 
@@ -291,8 +272,11 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 	BlockNumber dummy;
 
 	if (!fsm_allow_writes(rel, heapBlk, nblocks, &dummy))
-		/* No FSM to update and no local map either */
+	{
+		/* Notify other backends that their local maps are out of date. */
+		CacheInvalidateRelcache(rel);
 		return;
+	}
 
 	/* Get the location of the FSM byte representing the heap block */
 	addr = fsm_get_location(heapBlk, &slot);
@@ -302,18 +286,21 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 }
 
 /*
- * Clear the local map.  We must call this when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
+ * Initialize FSM Relation Map.
  */
 void
-FSMClearLocalMap(void)
+RelationInitFSMMap(Relation rel)
 {
-	if (FSM_LOCAL_MAP_EXISTS)
-	{
-		fsm_local_map.nblocks = 0;
-		memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
-			   sizeof(fsm_local_map.map));
-	}
+	FSMLocalMap *fsm_local_map;
+	MemoryContext	oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(CacheMemoryContext);
+	fsm_local_map = palloc(sizeof(FSMLocalMap));
+	fsm_local_map->nblocks = 0;
+	memset(&fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
+		   sizeof(fsm_local_map->map));
+	rel->fsm_local_map = fsm_local_map;
+	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
@@ -792,8 +779,16 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
 		 rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
 		!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+	{
 		smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
 
+		/*
+		 * This is to notify other backends that their local maps are out of
+		 * date.
+		 */
+		CacheInvalidateRelcache(rel);
+	}
+
 	fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
 
 	while (fsm_nblocks_now < fsm_nblocks)
@@ -1122,41 +1117,26 @@ fsm_allow_writes(Relation rel, BlockNumber heapblk,
  * Initialize the local map of blocks to try, for when there is no FSM.
  *
  * When we initialize the map, the whole heap is potentially available to
- * try.  Testing revealed that trying every block can cause a small
- * performance dip compared to when we use a FSM, so we try every other
- * block instead.
+ * try.
  */
 static void
 fsm_local_set(Relation rel, BlockNumber cur_nblocks)
 {
-	BlockNumber blkno,
-				cached_target_block;
-
-	/* The local map must not be set already. */
-	Assert(!FSM_LOCAL_MAP_EXISTS);
+	BlockNumber blkno;
 
 	/*
-	 * Starting at the current last block in the relation and working
-	 * backwards, mark alternating blocks as available.
+	 * Mark all blocks as available.
 	 */
 	blkno = cur_nblocks - 1;
-	while (true)
+	while (BlockNumberIsValid(blkno) &&
+		   blkno >= 0)
 	{
-		fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
-		if (blkno >= 2)
-			blkno -= 2;
-		else
-			break;
+		rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
+		blkno--;
 	}
 
 	/* Cache the number of blocks. */
-	fsm_local_map.nblocks = cur_nblocks;
-
-	/* Set the status of the cached target block to 'unavailable'. */
-	cached_target_block = RelationGetTargetBlock(rel);
-	if (cached_target_block != InvalidBlockNumber &&
-		cached_target_block < cur_nblocks)
-		fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
+	rel->fsm_local_map->nblocks = cur_nblocks;
 }
 
 /*
@@ -1168,18 +1148,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
  * This function is used when there is no FSM.
  */
 static BlockNumber
-fsm_local_search(void)
+fsm_local_search(Relation rel)
 {
 	BlockNumber target_block;
 
 	/* Local map must be set by now. */
-	Assert(FSM_LOCAL_MAP_EXISTS);
+	Assert(FSM_LOCAL_MAP_EXISTS(rel));
 
-	target_block = fsm_local_map.nblocks;
+	target_block = rel->fsm_local_map->nblocks;
 	do
 	{
 		target_block--;
-		if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
+		if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
 			return target_block;
 	} while (target_block > 0);
 
@@ -1189,7 +1169,27 @@ fsm_local_search(void)
 	 * first, which would otherwise lead to the same conclusion again and
 	 * again.
 	 */
-	FSMClearLocalMap();
+	fsm_clear_local_map(rel);
 
 	return InvalidBlockNumber;
 }
+
+/*
+ * Clear the local map and invalidate the cache.  We call this when we found
+ * a block from FSM, when we have already tried all the blocks present in
+ * cache or when we are allowed to create FSM.  In all these cases, we don't
+ * need to further check the local map.
+ */
+static void
+fsm_clear_local_map(Relation rel)
+{
+	if (FSM_LOCAL_MAP_EXISTS(rel))
+	{
+		rel->fsm_local_map->nblocks = 0;
+		memset(&rel->fsm_local_map->map, FSM_LOCAL_NOT_AVAIL,
+			sizeof(rel->fsm_local_map->map));
+
+		/* Notify other backends that their local maps are out of date. */
+		CacheInvalidateRelcache(rel);
+	}
+}
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 90ff8ccf54..61eb64b8f6 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -1226,6 +1227,9 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	 */
 	RelationInitPhysicalAddr(relation);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(relation);
+
 	/* make sure relation is marked as having no open file yet */
 	relation->rd_smgr = NULL;
 
@@ -1936,6 +1940,9 @@ formrdesc(const char *relationName, Oid relationReltype,
 	 */
 	RelationInitPhysicalAddr(relation);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(relation);
+
 	/*
 	 * initialize the table am handler
 	 */
@@ -2612,6 +2619,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 		SWAPFIELD(Oid, rd_toastoid);
 		/* pgstat_info must be preserved */
 		SWAPFIELD(struct PgStat_TableStatus *, pgstat_info);
+		SWAPFIELD(struct FSMLocalMap *, fsm_local_map);
 		/* preserve old partitioning info if no logical change */
 		if (keep_partkey)
 		{
@@ -3376,6 +3384,9 @@ RelationBuildLocalRelation(const char *relname,
 
 	RelationInitPhysicalAddr(rel);
 
+	/* Initialize FSM map for the relation. */
+	RelationInitFSMMap(rel);
+
 	rel->rd_rel->relam = accessmtd;
 
 	if (relkind == RELKIND_RELATION ||
@@ -5668,6 +5679,9 @@ load_relcache_init_file(bool shared)
 		 */
 		RelationInitLockInfo(rel);
 		RelationInitPhysicalAddr(rel);
+
+		/* Initialize FSM map for the relation. */
+		RelationInitFSMMap(rel);
 	}
 
 	/*
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index dbaae651c5..2cabb725e2 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -21,6 +21,31 @@
 /* Only create the FSM if the heap has greater than this many blocks */
 #define HEAP_FSM_CREATION_THRESHOLD 4
 
+/*
+ * For small relations, we don't create FSM to save space, instead we use
+ * local in-memory map of pages to try.  To locate free space, we simply try
+ * pages directly without knowing ahead of time how much free space they have.
+ *
+ * Note that this map is used to the find the block with required free space
+ * for any given relation.  We clear this map when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ * See src/backend/storage/freespace/README for further details.
+ */
+typedef struct FSMLocalMap
+{
+	BlockNumber nblocks;
+	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
+}			FSMLocalMap;
+
+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL		0x01
+
+
 /* prototypes for public functions in freespace.c */
 extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
 extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
@@ -31,7 +56,7 @@ extern BlockNumber RecordAndGetPageWithFreeSpace(Relation rel,
 							  Size spaceNeeded);
 extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 						Size spaceAvail, BlockNumber nblocks);
-extern void FSMClearLocalMap(void);
+extern void RelationInitFSMMap(Relation rel);
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 							Size spaceAvail);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index bddfcafe26..5e12c40890 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -200,6 +200,7 @@ typedef struct RelationData
 
 	/* use "struct" here to avoid needing to include pgstat.h: */
 	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+	struct FSMLocalMap *fsm_local_map;
 } RelationData;

#52

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#51)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Thu, May 2, 2019 at 4:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 2, 2019 at 12:39 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

Okay, I have updated the patch to incorporate your changes and call
relcache invalidation at required places. I have updated comments at a
few places as well. The summarization of this patch is that (a) it
moves the local map to relation cache (b) performs the cache
invalidation whenever we create fsm (either via backend or vacuum),
when some space in a page is freed by vacuum (provided fsm doesn't
exist) or whenever the local map is cleared (c) additionally, we clear
the local map when we found a block from FSM, when we have already
tried all the blocks present in cache or when we are allowed to create
FSM.

If we agree on this, then we can update the README accordingly.

Can you please test/review?

There isn't enough time. But since I already wrote some debugging
calls earlier (attached), I gave it a brief spin, I found this patch
isn't as careful as HEAD making sure we don't try the same block twice
in a row. If you insert enough tuples into an empty table such that we
need to extend, you get something like this:

DEBUG: Not enough space on block 0
DEBUG: Now trying block 0
DEBUG: Not enough space on block 0
DEBUG: Updating local map for block 0

At this point, I'm sorry to say, but I'm in favor of reverting. There
just wasn't enough time to redesign and debug a feature like this
during feature freeze.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

debug-free-space.patchapplication/octet-stream; name=debug-free-space.patchDownload

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index d854a84bf3..d2da5e2e6e 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -526,10 +526,14 @@ loop:
 		 * Update FSM as to condition of this page, and ask for another page
 		 * to try.
 		 */
+elog(DEBUG1, "Not enough space on block %u", targetBlock);
 		targetBlock = RecordAndGetPageWithFreeSpace(relation,
 													targetBlock,
 													pageFreeSpace,
 													len + saveFreeSpace);
+if (targetBlock != InvalidBlockNumber)
+	elog(DEBUG1, "Now trying block %u", targetBlock);
+
 	}
 
 	/*
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 46b95158db..762c1a9347 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -216,6 +216,8 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 				rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
 			   rel->fsm_local_map->map[oldPage] == FSM_LOCAL_AVAIL);
 
+elog(DEBUG1, "Updating local map for block %u", oldPage);
+
 		rel->fsm_local_map->map[oldPage] = FSM_LOCAL_NOT_AVAIL;
 		return fsm_local_search(rel);
 	}
@@ -235,6 +237,8 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 
 	/* Normal FSM logic follows */
 
+elog(DEBUG1, "Updating FSM for block %u", oldPage);
+
 	old_cat = fsm_space_avail_to_cat(oldSpaceAvail);
 	search_cat = fsm_space_needed_to_cat(spaceNeeded);

#53

amit.kapila16@gmail.com

over 6 years ago

In reply to: John Naylor (#52)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, May 3, 2019 at 11:43 AM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Thu, May 2, 2019 at 4:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 2, 2019 at 12:39 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

Can you please test/review?

There isn't enough time. But since I already wrote some debugging
calls earlier (attached), I gave it a brief spin, I found this patch
isn't as careful as HEAD making sure we don't try the same block twice
in a row. If you insert enough tuples into an empty table such that we
need to extend, you get something like this:

DEBUG: Not enough space on block 0
DEBUG: Now trying block 0
DEBUG: Not enough space on block 0
DEBUG: Updating local map for block 0

At this point, I'm sorry to say, but I'm in favor of reverting.

Fair enough. I think we have tried to come up with a patch for an
alternative approach, but it needs time. I will revert this tomorrow.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#54

amit.kapila16@gmail.com

over 6 years ago

In reply to: Amit Kapila (#53)

1 attachment(s)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Fri, May 3, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 3, 2019 at 11:43 AM John Naylor <john.naylor@2ndquadrant.com> wrote:

Fair enough. I think we have tried to come up with a patch for an
alternative approach, but it needs time. I will revert this tomorrow.

Attached is a revert patch. John, can you please once double-check to
ensure I have not missed anything?

To summarize for everyone: This patch avoids the fsm creation for
small relations (which is a small but good improvement as it saves
space). This patch was using a process local map to track the first
few blocks and was reset as soon as we get the block with enough free
space. It was discussed in this thread that it would be better to
track the local map in relcache and then invalidate it whenever vacuum
frees up space in the page or when FSM is created. There is a
prototype patch written for the same, but it is not 100% clear to me
that the new idea would be a win in all cases (like code complexity or
API design-wise) especially because resetting the map is not
straight-forward. As time was not enough, we couldn't complete the
patch from all aspects to see if it is really better in all cases.

We have two options (a) revert this patch and try the new approach in
next release, (b) keep the current patch and replace with the new
approach if it turns out to be better in next release.

So, do we want to keep this feature for this release?

I am fine going with option (a), that's why I have prepared a revert
patch, but I have a slight fear that the other option might not turn
out to be better and even if it is then we can anyway replace it as
shown in the prototype, so going with option (b) doesn't sound to be
dumb.

Anybody else wants to weigh in?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Revert-Avoid-the-creation-of-the-free-space-map-for-.patchapplication/x-patch; name=0001-Revert-Avoid-the-creation-of-the-free-space-map-for-.patchDownload

From 88fe6cd26dbcce9a64e44e86bd00ecc3deabfe10 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 4 May 2019 10:33:16 +0530
Subject: [PATCH] Revert "Avoid the creation of the free space map for small
 heap relations".

This feature was using a process local map to track the first few blocks
in the relation.  The map was reset each time we get the block with enough
freespace.  It was discussed that it would be better to track this map on
a per-relation basis in relcache and then invalidate the same whenever
vacuum frees up some space in the page or when FSM is created.  The new
design would be better both in terms of API design and performance.

List of commits reverted, in reverse chronological order:

06c8a5090e  Improve code comments in b0eaa4c51b.
13e8643bfc  During pg_upgrade, conditionally skip transfer of FSMs.
6f918159a9  Add more tests for FSM.
9c32e4c350  Clear the local map when not used.
29d108cdec  Update the documentation for FSM behavior..
08ecdfe7e5  Make FSM test portable.
b0eaa4c51b  Avoid creation of the free space map for small heap relations.

Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
---
 contrib/pageinspect/expected/page.out     |  64 ++++---
 contrib/pageinspect/sql/page.sql          |  34 ++--
 contrib/pgstattuple/pgstatapprox.c        |   3 -
 doc/src/sgml/pgfreespacemap.sgml          |   2 -
 doc/src/sgml/pgstattuple.sgml             |   4 +-
 doc/src/sgml/ref/pgupgrade.sgml           |   7 -
 doc/src/sgml/storage.sgml                 |  13 +-
 src/backend/access/brin/brin.c            |   2 +-
 src/backend/access/brin/brin_pageops.c    |  10 +-
 src/backend/access/heap/hio.c             |  42 ++---
 src/backend/access/heap/vacuumlazy.c      |  17 +-
 src/backend/access/transam/xact.c         |  14 --
 src/backend/storage/freespace/README      |  38 +---
 src/backend/storage/freespace/freespace.c | 301 +-----------------------------
 src/backend/storage/freespace/indexfsm.c  |   6 +-
 src/bin/pg_upgrade/info.c                 |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h           |   6 -
 src/bin/pg_upgrade/relfilenode.c          |  63 +------
 src/include/storage/freespace.h           |   9 +-
 src/test/regress/expected/fsm.out         |  73 --------
 src/test/regress/parallel_schedule        |   2 +-
 src/test/regress/serial_schedule          |   1 -
 src/test/regress/sql/fsm.sql              |  66 -------
 23 files changed, 107 insertions(+), 686 deletions(-)
 delete mode 100644 src/test/regress/expected/fsm.out
 delete mode 100644 src/test/regress/sql/fsm.sql

diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 7445480..3fcd9fb 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -1,56 +1,48 @@
 CREATE EXTENSION pageinspect;
-CREATE TABLE test_rel_forks (a int);
--- Make sure there are enough blocks in the heap for the FSM to be created.
-INSERT INTO test_rel_forks SELECT i from generate_series(1,2000) i;
--- set up FSM and VM
-VACUUM test_rel_forks;
+CREATE TABLE test1 (a int, b int);
+INSERT INTO test1 VALUES (16777217, 131584);
+VACUUM test1;  -- set up FSM
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
-SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
+SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
  main_0 
 --------
    8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
-ERROR:  block number 100 is out of range for relation "test_rel_forks"
-SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
+SELECT octet_length(get_raw_page('test1', 'main', 1)) AS main_1;
+ERROR:  block number 1 is out of range for relation "test1"
+SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0;
  fsm_0 
 -------
   8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 20)) AS fsm_20;
-ERROR:  block number 20 is out of range for relation "test_rel_forks"
-SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
+SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
+ fsm_1 
+-------
+  8192
+(1 row)
+
+SELECT octet_length(get_raw_page('test1', 'vm', 0)) AS vm_0;
  vm_0 
 ------
  8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1;
-ERROR:  block number 1 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test1', 'vm', 1)) AS vm_1;
+ERROR:  block number 1 is out of range for relation "test1"
 SELECT octet_length(get_raw_page('xxx', 'main', 0));
 ERROR:  relation "xxx" does not exist
-SELECT octet_length(get_raw_page('test_rel_forks', 'xxx', 0));
+SELECT octet_length(get_raw_page('test1', 'xxx', 0));
 ERROR:  invalid fork name
 HINT:  Valid fork names are "main", "fsm", "vm", and "init".
-EXPLAIN (costs off, analyze on, timing off, summary off) SELECT * FROM
-        fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
-                         QUERY PLAN                         
-------------------------------------------------------------
- Function Scan on fsm_page_contents (actual rows=1 loops=1)
-(1 row)
-
-SELECT get_raw_page('test_rel_forks', 0) = get_raw_page('test_rel_forks', 'main', 0);
+SELECT get_raw_page('test1', 0) = get_raw_page('test1', 'main', 0);
  ?column? 
 ----------
  t
 (1 row)
 
-DROP TABLE test_rel_forks;
-CREATE TABLE test1 (a int, b int);
-INSERT INTO test1 VALUES (16777217, 131584);
 SELECT pagesize, version FROM page_header(get_raw_page('test1', 0));
  pagesize | version 
 ----------+---------
@@ -70,6 +62,26 @@ SELECT tuple_data_split('test1'::regclass, t_data, t_infomask, t_infomask2, t_bi
  {"\\x01000001","\\x00020200"}
 (1 row)
 
+SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
+ fsm_page_contents 
+-------------------
+ 0: 254           +
+ 1: 254           +
+ 3: 254           +
+ 7: 254           +
+ 15: 254          +
+ 31: 254          +
+ 63: 254          +
+ 127: 254         +
+ 255: 254         +
+ 511: 254         +
+ 1023: 254        +
+ 2047: 254        +
+ 4095: 254        +
+ fp_next_slot: 0  +
+ 
+(1 row)
+
 DROP TABLE test1;
 -- check that using any of these functions with a partitioned table or index
 -- would fail
diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql
index e88598a..8ac9991 100644
--- a/contrib/pageinspect/sql/page.sql
+++ b/contrib/pageinspect/sql/page.sql
@@ -1,36 +1,26 @@
 CREATE EXTENSION pageinspect;
 
-CREATE TABLE test_rel_forks (a int);
--- Make sure there are enough blocks in the heap for the FSM to be created.
-INSERT INTO test_rel_forks SELECT i from generate_series(1,2000) i;
+CREATE TABLE test1 (a int, b int);
+INSERT INTO test1 VALUES (16777217, 131584);
 
--- set up FSM and VM
-VACUUM test_rel_forks;
+VACUUM test1;  -- set up FSM
 
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
 
-SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
-SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
+SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
+SELECT octet_length(get_raw_page('test1', 'main', 1)) AS main_1;
 
-SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
-SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 20)) AS fsm_20;
+SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0;
+SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
 
-SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
-SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1;
+SELECT octet_length(get_raw_page('test1', 'vm', 0)) AS vm_0;
+SELECT octet_length(get_raw_page('test1', 'vm', 1)) AS vm_1;
 
 SELECT octet_length(get_raw_page('xxx', 'main', 0));
-SELECT octet_length(get_raw_page('test_rel_forks', 'xxx', 0));
-
-EXPLAIN (costs off, analyze on, timing off, summary off) SELECT * FROM
-        fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
-
-SELECT get_raw_page('test_rel_forks', 0) = get_raw_page('test_rel_forks', 'main', 0);
+SELECT octet_length(get_raw_page('test1', 'xxx', 0));
 
-DROP TABLE test_rel_forks;
-
-CREATE TABLE test1 (a int, b int);
-INSERT INTO test1 VALUES (16777217, 131584);
+SELECT get_raw_page('test1', 0) = get_raw_page('test1', 'main', 0);
 
 SELECT pagesize, version FROM page_header(get_raw_page('test1', 0));
 
@@ -39,6 +29,8 @@ SELECT page_checksum(get_raw_page('test1', 0), 0) IS NOT NULL AS silly_checksum_
 SELECT tuple_data_split('test1'::regclass, t_data, t_infomask, t_infomask2, t_bits)
     FROM heap_page_items(get_raw_page('test1', 0));
 
+SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
+
 DROP TABLE test1;
 
 -- check that using any of these functions with a partitioned table or index
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index ed62aef..636c8d4 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -90,9 +90,6 @@ statapprox_heap(Relation rel, output_type *stat)
 		/*
 		 * If the page has only visible tuples, then we can find out the free
 		 * space from the FSM and move on.
-		 *
-		 * Note: If a relation has no FSM, GetRecordedFreeSpace() will report
-		 * zero free space.  This is fine for the purposes of approximation.
 		 */
 		if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 		{
diff --git a/doc/src/sgml/pgfreespacemap.sgml b/doc/src/sgml/pgfreespacemap.sgml
index 0bcf79b..0122d27 100644
--- a/doc/src/sgml/pgfreespacemap.sgml
+++ b/doc/src/sgml/pgfreespacemap.sgml
@@ -61,8 +61,6 @@
    The values stored in the free space map are not exact. They're rounded
    to precision of 1/256th of <symbol>BLCKSZ</symbol> (32 bytes with default <symbol>BLCKSZ</symbol>), and
    they're not kept fully up-to-date as tuples are inserted and updated.
-   In addition, small tables don't have a free space map, so these functions
-   will return zero even if free space is available.
   </para>
 
   <para>
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 8797131..b17b3c5 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -527,9 +527,7 @@ approx_free_percent  | 2.09
       bit set, then it is assumed to contain no dead tuples). For such
       pages, it derives the free space value from the free space map, and
       assumes that the rest of the space on the page is taken up by live
-      tuples. Small tables don't have a free space map, so in that case
-      this function will report zero free space, likewise inflating the
-      approximate tuple length.
+      tuples.
      </para>
 
      <para>
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 26e3f9e..8288676 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -812,13 +812,6 @@ psql --username=postgres --file=script.sql postgres
    is down.
   </para>
 
-  <para>
-   In <productname>PostgreSQL</productname> 12 and later small tables by
-   default don't have a free space map, as a space optimization.  If you are
-   upgrading a pre-12 cluster, the free space maps of small tables will
-   likewise not be transferred to the new cluster.
-  </para>
-
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index e0915b6..4285b0a 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -598,13 +598,12 @@ tuple would otherwise be too big.
 <indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm>
 
 <para>
-Each heap relation, unless it is very small, and each index relation, except
-for hash indexes, has a Free Space Map (FSM) to keep track of available
-space in the relation. It's stored alongside the main relation data in a
-separate relation fork, named after the filenode number of the relation, plus
-a <literal>_fsm</literal> suffix. For example, if the filenode of a relation
-is 12345, the FSM is stored in a file called <filename>12345_fsm</filename>,
-in the same directory as the main relation file.
+Each heap and index relation, except for hash indexes, has a Free Space Map
+(FSM) to keep track of available space in the relation. It's stored
+alongside the main relation data in a separate relation fork, named after the
+filenode number of the relation, plus a <literal>_fsm</literal> suffix. For example,
+if the filenode of a relation is 12345, the FSM is stored in a file called
+<filename>12345_fsm</filename>, in the same directory as the main relation file.
 </para>
 
 <para>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5c2b0c7..aba234c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1152,7 +1152,7 @@ terminate_brin_buildstate(BrinBuildState *state)
 		freespace = PageGetFreeSpace(page);
 		blk = BufferGetBlockNumber(state->bs_currentInsertBuf);
 		ReleaseBuffer(state->bs_currentInsertBuf);
-		RecordPageWithFreeSpace(state->bs_irel, blk, freespace, InvalidBlockNumber);
+		RecordPageWithFreeSpace(state->bs_irel, blk, freespace);
 		FreeSpaceMapVacuumRange(state->bs_irel, blk, blk + 1);
 	}
 
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 2e83aa4..fcc468c 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -310,7 +310,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 		if (extended)
 		{
-			RecordPageWithFreeSpace(idxrel, newblk, freespace, InvalidBlockNumber);
+			RecordPageWithFreeSpace(idxrel, newblk, freespace);
 			FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1);
 		}
 
@@ -461,7 +461,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 
 	if (extended)
 	{
-		RecordPageWithFreeSpace(idxrel, blk, freespace, InvalidBlockNumber);
+		RecordPageWithFreeSpace(idxrel, blk, freespace);
 		FreeSpaceMapVacuumRange(idxrel, blk, blk + 1);
 	}
 
@@ -654,7 +654,7 @@ brin_page_cleanup(Relation idxrel, Buffer buf)
 
 	/* Measure free space and record it */
 	RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf),
-							br_page_get_freespace(page), InvalidBlockNumber);
+							br_page_get_freespace(page));
 }
 
 /*
@@ -703,7 +703,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 	/* Choose initial target page, re-using existing target if known */
 	newblk = RelationGetTargetBlock(irel);
 	if (newblk == InvalidBlockNumber)
-		newblk = GetPageWithFreeSpace(irel, itemsz, true);
+		newblk = GetPageWithFreeSpace(irel, itemsz);
 
 	/*
 	 * Loop until we find a page with sufficient free space.  By the time we
@@ -895,7 +895,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 	 * pages whose FSM records were forgotten in a crash.
 	 */
 	RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buffer),
-							br_page_get_freespace(page), InvalidBlockNumber);
+							br_page_get_freespace(page));
 }
 
 
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 69a7a23..d41d318 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -246,14 +246,8 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 		 * Immediately update the bottom level of the FSM.  This has a good
 		 * chance of making this page visible to other concurrently inserting
 		 * backends, and we want that to happen without delay.
-		 *
-		 * Since we know the table will end up with extraBlocks additional
-		 * pages, we pass the final number to avoid possible unnecessary
-		 * system calls and to make sure the FSM is created when we add the
-		 * first new page.
 		 */
-		RecordPageWithFreeSpace(relation, blockNum, freespace,
-								firstBlock + extraBlocks);
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
 	}
 	while (--extraBlocks > 0);
 
@@ -390,9 +384,20 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		 * We have no cached target page, so ask the FSM for an initial
 		 * target.
 		 */
-		targetBlock = GetPageWithFreeSpace(relation,
-										   len + saveFreeSpace,
-										   false);
+		targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+		/*
+		 * If the FSM knows nothing of the rel, try the last page before we
+		 * give up and extend.  This avoids one-tuple-per-page syndrome during
+		 * bootstrapping or in a recently-started system.
+		 */
+		if (targetBlock == InvalidBlockNumber)
+		{
+			BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+
+			if (nblocks > 0)
+				targetBlock = nblocks - 1;
+		}
 	}
 
 loop:
@@ -499,13 +504,6 @@ loop:
 		{
 			/* use this page as future insert target, too */
 			RelationSetTargetBlock(relation, targetBlock);
-
-			/*
-			 * In case we used an in-memory map of available blocks, reset it
-			 * for next use.
-			 */
-			FSMClearLocalMap();
-
 			return buffer;
 		}
 
@@ -565,12 +563,9 @@ loop:
 
 			/*
 			 * Check if some other backend has extended a block for us while
-			 * we were waiting on the lock.  We only check the FSM -- if there
-			 * isn't one we don't recheck the number of blocks.
+			 * we were waiting on the lock.
 			 */
-			targetBlock = GetPageWithFreeSpace(relation,
-											   len + saveFreeSpace,
-											   true);
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
 
 			/*
 			 * If some other waiter has already extended the relation, we
@@ -675,8 +670,5 @@ loop:
 	 */
 	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
 
-	/* This should already be cleared by now, but make sure it is. */
-	FSMClearLocalMap();
-
 	return buffer;
 }
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2aa7298..0f70dc8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,7 +153,7 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 			   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 			   bool aggressive);
-static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats, BlockNumber nblocks);
+static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
 				  IndexBulkDeleteResult **stats,
@@ -780,7 +780,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			pgstat_progress_update_multi_param(2, hvp_index, hvp_val);
 
 			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats, nblocks);
+			lazy_vacuum_heap(onerel, vacrelstats);
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -919,7 +919,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					Size		freespace;
 
 					freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-					RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks);
+					RecordPageWithFreeSpace(onerel, blkno, freespace);
 				}
 			}
 			continue;
@@ -963,7 +963,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks);
+			RecordPageWithFreeSpace(onerel, blkno, freespace);
 			continue;
 		}
 
@@ -1381,7 +1381,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * taken if there are no indexes.)
 		 */
 		if (vacrelstats->num_dead_tuples == prev_dead_count)
-			RecordPageWithFreeSpace(onerel, blkno, freespace, nblocks);
+			RecordPageWithFreeSpace(onerel, blkno, freespace);
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1443,7 +1443,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		/* Remove tuples from heap */
 		pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 									 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-		lazy_vacuum_heap(onerel, vacrelstats, nblocks);
+		lazy_vacuum_heap(onerel, vacrelstats);
 		vacrelstats->num_index_scans++;
 	}
 
@@ -1517,10 +1517,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
  * Note: the reason for doing this as a second pass is we cannot remove
  * the tuples until we've removed their index entries, and we want to
  * process index entry removal in batches as large as possible.
- * Note: nblocks is passed as an optimization for RecordPageWithFreeSpace().
  */
 static void
-lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats, BlockNumber nblocks)
+lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 {
 	int			tupindex;
 	int			npages;
@@ -1557,7 +1556,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats, BlockNumber nblocks)
 		freespace = PageGetHeapFreeSpace(page);
 
 		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace, nblocks);
+		RecordPageWithFreeSpace(onerel, tblk, freespace);
 		npages++;
 	}
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8522a2f..0a34651 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -48,7 +48,6 @@
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
-#include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/md.h"
 #include "storage/predicate.h"
@@ -2587,12 +2586,6 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	/* Clean up buffer I/O and buffer context locks, too */
 	AbortBufferIO();
 	UnlockBuffers();
@@ -4880,13 +4873,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-
-	/*
-	 * In case we aborted during RelationGetBufferForTuple(), clear the local
-	 * map of heap pages.
-	 */
-	FSMClearLocalMap();
-
 	AbortBufferIO();
 	UnlockBuffers();
 
diff --git a/src/backend/storage/freespace/README b/src/backend/storage/freespace/README
index 0d3cd29..e7ff23b 100644
--- a/src/backend/storage/freespace/README
+++ b/src/backend/storage/freespace/README
@@ -8,41 +8,7 @@ free space to hold a tuple to be stored; or to determine that no such page
 exists and the relation must be extended by one page.  As of PostgreSQL 8.4
 each relation has its own, extensible free space map stored in a separate
 "fork" of its relation.  This eliminates the disadvantages of the former
-fixed-size FSM.  There are two exceptions:
-
-1. Hash indexes never have a FSM.
-2. For very small tables, a 3-page relation fork would be relatively large
-and wasteful, so to save space we refrain from creating the FSM if the
-heap has HEAP_FSM_CREATION_THRESHOLD pages or fewer.
-
-To locate free space in the latter case, we simply try pages directly without
-knowing ahead of time how much free space they have.  To maintain good
-performance, we create a local in-memory map of pages to try, and only mark
-every other page as available.  For example, in a 3-page heap, the local map
-would look like:
-
-ANAN
-0123
-
-Pages 0 and 2 are marked "available", and page 1 as "not available".
-Page 3 is beyond the end of the relation, so is likewise marked "not
-available".  First we try page 2, and if that doesn't have sufficient free
-space we try page 0 before giving up and extending the relation.  There may
-be some wasted free space on block 1, but if the relation extends to 4 pages:
-
-NANA
-0123
-
-We not only have the new page 3 at our disposal, we can now check page 1
-for free space as well.
-
-Once the FSM is created for a heap we don't remove it even if somebody deletes
-all the rows from the corresponding relation.  We don't think it is a useful
-optimization as it is quite likely that relation will again grow to the same
-size.
-
-FSM data structure
-------------------
+fixed-size FSM.
 
 It is important to keep the map small so that it can be searched rapidly.
 Therefore, we don't attempt to record the exact free space on a page.
@@ -226,3 +192,5 @@ TODO
 ----
 
 - fastroot to avoid traversing upper nodes with just 1 child
+- use a different system for tables that fit into one FSM page, with a
+  mechanism to switch to the real thing as it grows.
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index c3ed424..eee8286 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -76,14 +76,6 @@
 #define FSM_ROOT_LEVEL	(FSM_TREE_DEPTH - 1)
 #define FSM_BOTTOM_LEVEL 0
 
-/* Status codes for the local map. */
-
-/* Either already tried, or beyond the end of the relation */
-#define FSM_LOCAL_NOT_AVAIL 0x00
-
-/* Available to try */
-#define FSM_LOCAL_AVAIL		0x01
-
 /*
  * The internal FSM routines work on a logical addressing scheme. Each
  * level of the tree can be thought of as a separately addressable file.
@@ -97,32 +89,6 @@ typedef struct
 /* Address of the root page. */
 static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
 
-/*
- * For small relations, we don't create FSM to save space, instead we use
- * local in-memory map of pages to try.  To locate free space, we simply try
- * pages directly without knowing ahead of time how much free space they have.
- *
- * Note that this map is used to the find the block with required free space
- * for any given relation.  We clear this map when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
- * See src/backend/storage/freespace/README for further details.
- */
-typedef struct
-{
-	BlockNumber nblocks;
-	uint8		map[HEAP_FSM_CREATION_THRESHOLD];
-}			FSMLocalMap;
-
-static FSMLocalMap fsm_local_map =
-{
-	0,
-	{
-		FSM_LOCAL_NOT_AVAIL
-	}
-};
-
-#define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)
-
 /* functions to navigate the tree */
 static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
 static FSMAddress fsm_get_parent(FSMAddress child, uint16 *slot);
@@ -141,14 +107,10 @@ static Size fsm_space_cat_to_avail(uint8 cat);
 /* workhorse functions for various operations */
 static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
-static void fsm_local_set(Relation rel, BlockNumber cur_nblocks);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
-static BlockNumber fsm_local_search(void);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
 				BlockNumber start, BlockNumber end,
 				bool *eof);
-static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
-				 BlockNumber nblocks, BlockNumber *get_nblocks);
 
 
 /******** Public API ********/
@@ -165,46 +127,13 @@ static bool fsm_allow_writes(Relation rel, BlockNumber heapblk,
  * amount of free space available on that page and then try again (see
  * RecordAndGetPageWithFreeSpace).  If InvalidBlockNumber is returned,
  * extend the relation.
- *
- * For very small heap relations that don't have a FSM, we try every other
- * page before extending the relation.  To keep track of which pages have
- * been tried, initialize a local in-memory map of pages.
  */
 BlockNumber
-GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
+GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
 	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
-	BlockNumber target_block,
-				nblocks;
 
-	/* First try the FSM, if it exists. */
-	target_block = fsm_search(rel, min_cat);
-
-	if (target_block == InvalidBlockNumber &&
-		(rel->rd_rel->relkind == RELKIND_RELATION ||
-		 rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-		!check_fsm_only)
-	{
-		nblocks = RelationGetNumberOfBlocks(rel);
-
-		if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
-		{
-			/*
-			 * If the FSM knows nothing of the rel, try the last page before
-			 * we give up and extend.  This avoids one-tuple-per-page syndrome
-			 * during bootstrapping or in a recently-started system.
-			 */
-			target_block = nblocks - 1;
-		}
-		else if (nblocks > 0)
-		{
-			/* Initialize local map and get first candidate block. */
-			fsm_local_set(rel, nblocks);
-			target_block = fsm_local_search();
-		}
-	}
-
-	return target_block;
+	return fsm_search(rel, min_cat);
 }
 
 /*
@@ -215,47 +144,16 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded, bool check_fsm_only)
  * also some effort to return a page close to the old page; if there's a
  * page with enough free space on the same FSM page where the old one page
  * is located, it is preferred.
- *
- * For very small heap relations that don't have a FSM, we update the local
- * map to indicate we have tried a page, and return the next page to try.
  */
 BlockNumber
 RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 							  Size oldSpaceAvail, Size spaceNeeded)
 {
-	int			old_cat;
-	int			search_cat;
+	int			old_cat = fsm_space_avail_to_cat(oldSpaceAvail);
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
 	FSMAddress	addr;
 	uint16		slot;
 	int			search_slot;
-	BlockNumber nblocks = InvalidBlockNumber;
-
-	/* First try the local map, if it exists. */
-	if (FSM_LOCAL_MAP_EXISTS)
-	{
-		Assert((rel->rd_rel->relkind == RELKIND_RELATION ||
-				rel->rd_rel->relkind == RELKIND_TOASTVALUE) &&
-			   fsm_local_map.map[oldPage] == FSM_LOCAL_AVAIL);
-
-		fsm_local_map.map[oldPage] = FSM_LOCAL_NOT_AVAIL;
-		return fsm_local_search();
-	}
-
-	if (!fsm_allow_writes(rel, oldPage, InvalidBlockNumber, &nblocks))
-	{
-		/*
-		 * If we have neither a local map nor a FSM, we probably just tried
-		 * the target block in the smgr relation entry and failed, so we'll
-		 * need to create the local map.
-		 */
-		fsm_local_set(rel, nblocks);
-		return fsm_local_search();
-	}
-
-	/* Normal FSM logic follows */
-
-	old_cat = fsm_space_avail_to_cat(oldSpaceAvail);
-	search_cat = fsm_space_needed_to_cat(spaceNeeded);
 
 	/* Get the location of the FSM byte representing the heap block */
 	addr = fsm_get_location(oldPage, &slot);
@@ -278,45 +176,21 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
  * Note that if the new spaceAvail value is higher than the old value stored
  * in the FSM, the space might not become visible to searchers until the next
  * FreeSpaceMapVacuum call, which updates the upper level pages.
- *
- * Callers have no need for a local map.
  */
 void
-RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
-						Size spaceAvail, BlockNumber nblocks)
+RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 {
-	int			new_cat;
+	int			new_cat = fsm_space_avail_to_cat(spaceAvail);
 	FSMAddress	addr;
 	uint16		slot;
-	BlockNumber dummy;
-
-	if (!fsm_allow_writes(rel, heapBlk, nblocks, &dummy))
-		/* No FSM to update and no local map either */
-		return;
 
 	/* Get the location of the FSM byte representing the heap block */
 	addr = fsm_get_location(heapBlk, &slot);
 
-	new_cat = fsm_space_avail_to_cat(spaceAvail);
 	fsm_set_and_search(rel, addr, slot, new_cat, 0);
 }
 
 /*
- * Clear the local map.  We must call this when we have found a block with
- * enough free space, when we extend the relation, or on transaction abort.
- */
-void
-FSMClearLocalMap(void)
-{
-	if (FSM_LOCAL_MAP_EXISTS)
-	{
-		fsm_local_map.nblocks = 0;
-		memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
-			   sizeof(fsm_local_map.map));
-	}
-}
-
-/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -330,31 +204,6 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
-	bool		write_to_fsm;
-
-	/* This is meant to mirror the logic in fsm_allow_writes() */
-	if (heapBlk >= HEAP_FSM_CREATION_THRESHOLD)
-		write_to_fsm = true;
-	else
-	{
-		/* Open the relation at smgr level */
-		SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-		if (smgrexists(smgr, FSM_FORKNUM))
-			write_to_fsm = true;
-		else
-		{
-			BlockNumber heap_nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
-
-			if (heap_nblocks > HEAP_FSM_CREATION_THRESHOLD)
-				write_to_fsm = true;
-			else
-				write_to_fsm = false;
-		}
-	}
-
-	if (!write_to_fsm)
-		return;
 
 	/* Get the location of the FSM byte representing the heap block */
 	addr = fsm_get_location(heapBlk, &slot);
@@ -1055,141 +904,3 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
 
 	return max_avail;
 }
-
-/*
- * For heaps, we prevent creation of the FSM unless the number of pages
- * exceeds HEAP_FSM_CREATION_THRESHOLD.  For tables that don't already have
- * a FSM, this will save an inode and a few kB of space.
- *
- * XXX The API is a little awkward -- if the caller passes a valid nblocks
- * value, it can avoid invoking a system call.  If the caller passes
- * InvalidBlockNumber and receives a false return value, it can get an
- * up-to-date relation size from get_nblocks.  This saves a few cycles in
- * the caller, which would otherwise need to get the relation size by itself.
- */
-static bool
-fsm_allow_writes(Relation rel, BlockNumber heapblk,
-				 BlockNumber nblocks, BlockNumber *get_nblocks)
-{
-	bool		skip_get_nblocks;
-
-	if (heapblk >= HEAP_FSM_CREATION_THRESHOLD)
-		return true;
-
-	/* Non-heap rels can always create a FSM. */
-	if (rel->rd_rel->relkind != RELKIND_RELATION &&
-		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
-		return true;
-
-	/*
-	 * If the caller knows nblocks, we can avoid a system call later. If it
-	 * doesn't, maybe we have relpages from a previous VACUUM. Since the table
-	 * may have extended since then, we still have to count the pages later if
-	 * we can't return now.
-	 */
-	if (nblocks != InvalidBlockNumber)
-	{
-		if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
-			return true;
-		else
-			skip_get_nblocks = true;
-	}
-	else
-	{
-		if (rel->rd_rel->relpages != InvalidBlockNumber &&
-			rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
-			return true;
-		else
-			skip_get_nblocks = false;
-	}
-
-	RelationOpenSmgr(rel);
-	if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
-		return true;
-
-	if (skip_get_nblocks)
-		return false;
-
-	/* last resort */
-	*get_nblocks = RelationGetNumberOfBlocks(rel);
-	if (*get_nblocks > HEAP_FSM_CREATION_THRESHOLD)
-		return true;
-	else
-		return false;
-}
-
-/*
- * Initialize the local map of blocks to try, for when there is no FSM.
- *
- * When we initialize the map, the whole heap is potentially available to
- * try.  Testing revealed that trying every block can cause a small
- * performance dip compared to when we use a FSM, so we try every other
- * block instead.
- */
-static void
-fsm_local_set(Relation rel, BlockNumber cur_nblocks)
-{
-	BlockNumber blkno,
-				cached_target_block;
-
-	/* The local map must not be set already. */
-	Assert(!FSM_LOCAL_MAP_EXISTS);
-
-	/*
-	 * Starting at the current last block in the relation and working
-	 * backwards, mark alternating blocks as available.
-	 */
-	blkno = cur_nblocks - 1;
-	while (true)
-	{
-		fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
-		if (blkno >= 2)
-			blkno -= 2;
-		else
-			break;
-	}
-
-	/* Cache the number of blocks. */
-	fsm_local_map.nblocks = cur_nblocks;
-
-	/* Set the status of the cached target block to 'unavailable'. */
-	cached_target_block = RelationGetTargetBlock(rel);
-	if (cached_target_block != InvalidBlockNumber &&
-		cached_target_block < cur_nblocks)
-		fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
-}
-
-/*
- * Search the local map for an available block to try, in descending order.
- * As such, there is no heuristic available to decide which order will be
- * better to try, but the probability of having space in the last block in the
- * map is higher because that is the most recent block added to the heap.
- *
- * This function is used when there is no FSM.
- */
-static BlockNumber
-fsm_local_search(void)
-{
-	BlockNumber target_block;
-
-	/* Local map must be set by now. */
-	Assert(FSM_LOCAL_MAP_EXISTS);
-
-	target_block = fsm_local_map.nblocks;
-	do
-	{
-		target_block--;
-		if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
-			return target_block;
-	} while (target_block > 0);
-
-	/*
-	 * If we didn't find any available block to try in the local map, then
-	 * clear it.  This prevents us from using the map again without setting it
-	 * first, which would otherwise lead to the same conclusion again and
-	 * again.
-	 */
-	FSMClearLocalMap();
-
-	return InvalidBlockNumber;
-}
diff --git a/src/backend/storage/freespace/indexfsm.c b/src/backend/storage/freespace/indexfsm.c
index 9d8f43d..58cedea 100644
--- a/src/backend/storage/freespace/indexfsm.c
+++ b/src/backend/storage/freespace/indexfsm.c
@@ -37,7 +37,7 @@
 BlockNumber
 GetFreeIndexPage(Relation rel)
 {
-	BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2, true);
+	BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2);
 
 	if (blkno != InvalidBlockNumber)
 		RecordUsedIndexPage(rel, blkno);
@@ -51,7 +51,7 @@ GetFreeIndexPage(Relation rel)
 void
 RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
 {
-	RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1, InvalidBlockNumber);
+	RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1);
 }
 
 
@@ -61,7 +61,7 @@ RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
 void
 RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
 {
-	RecordPageWithFreeSpace(rel, usedBlock, 0, InvalidBlockNumber);
+	RecordPageWithFreeSpace(rel, usedBlock, 0);
 }
 
 /*
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 902bfc6..2f925f0 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -200,8 +200,6 @@ create_rel_filename_map(const char *old_data, const char *new_data,
 
 	map->old_db_oid = old_db->db_oid;
 	map->new_db_oid = new_db->db_oid;
-	map->relpages = old_rel->relpages;
-	map->relkind = old_rel->relkind;
 
 	/*
 	 * old_relfilenode might differ from pg_class.oid (and hence
@@ -420,7 +418,6 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 	char	   *nspname = NULL;
 	char	   *relname = NULL;
 	char	   *tablespace = NULL;
-	char	   *relkind = NULL;
 	int			i_spclocation,
 				i_nspname,
 				i_relname,
@@ -428,9 +425,7 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 				i_indtable,
 				i_toastheap,
 				i_relfilenode,
-				i_reltablespace,
-				i_relpages,
-				i_relkind;
+				i_reltablespace;
 	char		query[QUERY_ALLOC];
 	char	   *last_namespace = NULL,
 			   *last_tablespace = NULL;
@@ -499,7 +494,7 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 	 */
 	snprintf(query + strlen(query), sizeof(query) - strlen(query),
 			 "SELECT all_rels.*, n.nspname, c.relname, "
-			 "  c.relfilenode, c.reltablespace, c.relpages, c.relkind, %s "
+			 "  c.relfilenode, c.reltablespace, %s "
 			 "FROM (SELECT * FROM regular_heap "
 			 "      UNION ALL "
 			 "      SELECT * FROM toast_heap "
@@ -530,8 +525,6 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 	i_relname = PQfnumber(res, "relname");
 	i_relfilenode = PQfnumber(res, "relfilenode");
 	i_reltablespace = PQfnumber(res, "reltablespace");
-	i_relpages = PQfnumber(res, "relpages");
-	i_relkind = PQfnumber(res, "relkind");
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	for (relnum = 0; relnum < ntups; relnum++)
@@ -563,11 +556,6 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 		curr->relname = pg_strdup(relname);
 
 		curr->relfilenode = atooid(PQgetvalue(res, relnum, i_relfilenode));
-		curr->relpages = atoi(PQgetvalue(res, relnum, i_relpages));
-
-		relkind = PQgetvalue(res, relnum, i_relkind);
-		curr->relkind = relkind[0];
-
 		curr->tblsp_alloc = false;
 
 		/* Is the tablespace oid non-default? */
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index baeb8ff..2f67eee 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -147,8 +147,6 @@ typedef struct
 	char	   *tablespace;		/* tablespace path; "" for cluster default */
 	bool		nsp_alloc;		/* should nspname be freed? */
 	bool		tblsp_alloc;	/* should tablespace be freed? */
-	int32		relpages;		/* # of pages -- see pg_class.h */
-	char		relkind;		/* relation kind -- see pg_class.h */
 } RelInfo;
 
 typedef struct
@@ -175,10 +173,6 @@ typedef struct
 	 */
 	Oid			old_relfilenode;
 	Oid			new_relfilenode;
-
-	int32		relpages;		/* # of pages -- see pg_class.h */
-	char		relkind;		/* relation kind -- see pg_class.h */
-
 	/* the rest are used only for logging and error reporting */
 	char	   *nspname;		/* namespaces */
 	char	   *relname;
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index dd3c8ce..0c78073 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -14,12 +14,10 @@
 #include <sys/stat.h>
 #include "catalog/pg_class_d.h"
 #include "access/transam.h"
-#include "storage/freespace.h"
 
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
-static bool new_cluster_needs_fsm(FileNameMap *map);
 
 
 /*
@@ -176,8 +174,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 				/*
 				 * Copy/link any fsm and vm files, if they exist
 				 */
-				if (new_cluster_needs_fsm(&maps[mapnum]))
-					transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
+				transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
 				if (vm_crashsafe_match)
 					transfer_relfile(&maps[mapnum], "_vm", vm_must_add_frozenbit);
 			}
@@ -281,61 +278,3 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 			}
 	}
 }
-
-/*
- * new_cluster_needs_fsm()
- *
- * Return false for small heaps if we're upgrading across PG 12, the first
- * version where small heap relations don't have FSMs by default.
- */
-static bool
-new_cluster_needs_fsm(FileNameMap *map)
-{
-	char		old_primary_file[MAXPGPATH];
-	struct stat statbuf;
-
-	/* fsm/vm files added in PG 8.4 */
-	Assert(GET_MAJOR_VERSION(old_cluster.major_version) >= 804);
-
-	if (!(GET_MAJOR_VERSION(old_cluster.major_version) <= 1100 &&
-		  GET_MAJOR_VERSION(new_cluster.major_version) >= 1200))
-		return true;
-
-	/* Always transfer FSMs of non-heap relations. */
-	if (map->relkind != RELKIND_RELATION &&
-		map->relkind != RELKIND_TOASTVALUE)
-		return true;
-
-	/*
-	 * If pg_class.relpages falsely reports that the heap is above the
-	 * threshold, we will transfer a FSM when we don't need to, but this is
-	 * harmless.
-	 */
-	if (map->relpages > HEAP_FSM_CREATION_THRESHOLD)
-		return true;
-
-	/* Determine path of the primary file. */
-	snprintf(old_primary_file, sizeof(old_primary_file), "%s%s/%u/%u",
-			 map->old_tablespace,
-			 map->old_tablespace_suffix,
-			 map->old_db_oid,
-			 map->old_relfilenode);
-
-	/*
-	 * If pg_class.relpages falsely reports that the heap is below the
-	 * threshold, a FSM would be skipped when we actually need it.  To guard
-	 * against this, we verify the size of the primary file.
-	 */
-	if (stat(old_primary_file, &statbuf) != 0)
-	{
-		pg_fatal("error while checking for file existence \"%s.%s\" (\"%s\"): %s\n",
-				 map->nspname, map->relname, old_primary_file, strerror(errno));
-
-		/* Keep compiler quiet. */
-		return false;
-	}
-	else if (statbuf.st_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ)
-		return true;
-	else
-		return false;
-}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index dbaae65..8b00033 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -18,20 +18,15 @@
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
 
-/* Only create the FSM if the heap has greater than this many blocks */
-#define HEAP_FSM_CREATION_THRESHOLD 4
-
 /* prototypes for public functions in freespace.c */
 extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
-					 bool check_fsm_only);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
 extern BlockNumber RecordAndGetPageWithFreeSpace(Relation rel,
 							  BlockNumber oldPage,
 							  Size oldSpaceAvail,
 							  Size spaceNeeded);
 extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
-						Size spaceAvail, BlockNumber nblocks);
-extern void FSMClearLocalMap(void);
+						Size spaceAvail);
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 							Size spaceAvail);
 
diff --git a/src/test/regress/expected/fsm.out b/src/test/regress/expected/fsm.out
deleted file mode 100644
index 698d4b0..0000000
--- a/src/test/regress/expected/fsm.out
+++ /dev/null
@@ -1,73 +0,0 @@
---
--- Free Space Map test
---
-SELECT current_setting('block_size')::integer AS blocksize,
-current_setting('block_size')::integer / 8 AS strsize
-\gset
-CREATE TABLE fsm_check_size (num int, str text);
--- Fill 3 blocks with one record each
-ALTER TABLE fsm_check_size SET (fillfactor=15);
-INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
-FROM generate_series(1,3) i;
--- There should be no FSM
-VACUUM fsm_check_size;
-SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
-pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
- heap_nblocks | fsm_nblocks 
---------------+-------------
-            3 |           0
-(1 row)
-
--- The following operations are for testing the functionality of the local
--- in-memory map. In particular, we want to be able to insert into some
--- other block than the one at the end of the heap, without using a FSM.
--- Fill most of the last block
-ALTER TABLE fsm_check_size SET (fillfactor=100);
-INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
-FROM generate_series(101,105) i;
--- Make sure records can go into any block but the last one
-ALTER TABLE fsm_check_size SET (fillfactor=30);
--- Insert large record and make sure it does not cause the relation to extend
-INSERT INTO fsm_check_size VALUES (111, rpad('', :strsize, 'a'));
-VACUUM fsm_check_size;
-SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
-pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
- heap_nblocks | fsm_nblocks 
---------------+-------------
-            3 |           0
-(1 row)
-
--- Extend table with enough blocks to exceed the FSM threshold
-DO $$
-DECLARE curtid tid;
-num int;
-BEGIN
-num = 11;
-  LOOP
-    INSERT INTO fsm_check_size VALUES (num, 'b') RETURNING ctid INTO curtid;
-    EXIT WHEN curtid >= tid '(4, 0)';
-    num = num + 1;
-  END LOOP;
-END;
-$$;
-VACUUM fsm_check_size;
-SELECT pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
- fsm_nblocks 
--------------
-           3
-(1 row)
-
--- Add long random string to extend TOAST table to 1 block
-INSERT INTO fsm_check_size
-VALUES(0, (SELECT string_agg(md5(chr(i)), '')
-		   FROM generate_series(1, :blocksize / 100) i));
-VACUUM fsm_check_size;
-SELECT pg_relation_size(reltoastrelid, 'main') / :blocksize AS toast_nblocks,
-pg_relation_size(reltoastrelid, 'fsm') / :blocksize AS toast_fsm_nblocks
-FROM pg_class WHERE relname = 'fsm_check_size';
- toast_nblocks | toast_fsm_nblocks 
----------------+-------------------
-             1 |                 0
-(1 row)
-
-DROP TABLE fsm_check_size;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3b7759b..312b7c2 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -20,7 +20,7 @@ test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeri
 # strings depends on char, varchar and text
 # numerology depends on int2, int4, int8, float4, float8
 # ----------
-test: strings numerology point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 tstypes fsm
+test: strings numerology point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 tstypes
 
 # ----------
 # Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 43523c6..9d44c31 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -40,7 +40,6 @@ test: inet
 test: macaddr
 test: macaddr8
 test: tstypes
-test: fsm
 test: geometry
 test: horology
 test: regex
diff --git a/src/test/regress/sql/fsm.sql b/src/test/regress/sql/fsm.sql
deleted file mode 100644
index 7295e04..0000000
--- a/src/test/regress/sql/fsm.sql
+++ /dev/null
@@ -1,66 +0,0 @@
---
--- Free Space Map test
---
-SELECT current_setting('block_size')::integer AS blocksize,
-current_setting('block_size')::integer / 8 AS strsize
-\gset
-
-CREATE TABLE fsm_check_size (num int, str text);
-
--- Fill 3 blocks with one record each
-ALTER TABLE fsm_check_size SET (fillfactor=15);
-INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
-FROM generate_series(1,3) i;
-
--- There should be no FSM
-VACUUM fsm_check_size;
-SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
-pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
-
--- The following operations are for testing the functionality of the local
--- in-memory map. In particular, we want to be able to insert into some
--- other block than the one at the end of the heap, without using a FSM.
-
--- Fill most of the last block
-ALTER TABLE fsm_check_size SET (fillfactor=100);
-INSERT INTO fsm_check_size SELECT i, rpad('', :strsize, 'a')
-FROM generate_series(101,105) i;
-
--- Make sure records can go into any block but the last one
-ALTER TABLE fsm_check_size SET (fillfactor=30);
-
--- Insert large record and make sure it does not cause the relation to extend
-INSERT INTO fsm_check_size VALUES (111, rpad('', :strsize, 'a'));
-
-VACUUM fsm_check_size;
-SELECT pg_relation_size('fsm_check_size', 'main') / :blocksize AS heap_nblocks,
-pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
-
--- Extend table with enough blocks to exceed the FSM threshold
-DO $$
-DECLARE curtid tid;
-num int;
-BEGIN
-num = 11;
-  LOOP
-    INSERT INTO fsm_check_size VALUES (num, 'b') RETURNING ctid INTO curtid;
-    EXIT WHEN curtid >= tid '(4, 0)';
-    num = num + 1;
-  END LOOP;
-END;
-$$;
-
-VACUUM fsm_check_size;
-SELECT pg_relation_size('fsm_check_size', 'fsm') / :blocksize AS fsm_nblocks;
-
--- Add long random string to extend TOAST table to 1 block
-INSERT INTO fsm_check_size
-VALUES(0, (SELECT string_agg(md5(chr(i)), '')
-		   FROM generate_series(1, :blocksize / 100) i));
-
-VACUUM fsm_check_size;
-SELECT pg_relation_size(reltoastrelid, 'main') / :blocksize AS toast_nblocks,
-pg_relation_size(reltoastrelid, 'fsm') / :blocksize AS toast_fsm_nblocks
-FROM pg_class WHERE relname = 'fsm_check_size';
-
-DROP TABLE fsm_check_size;
-- 
1.8.3.1

#55

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Amit Kapila (#54)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Sat, May 4, 2019 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached is a revert patch. John, can you please once double-check to
ensure I have not missed anything?

Looks complete to me.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#56

amit.kapila16@gmail.com

over 6 years ago

In reply to: Amit Kapila (#54)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Sat, May 4, 2019 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 3, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 3, 2019 at 11:43 AM John Naylor <john.naylor@2ndquadrant.com> wrote:

Attached is a revert patch. John, can you please once double-check to
ensure I have not missed anything?

To summarize for everyone: This patch avoids the fsm creation for
small relations (which is a small but good improvement as it saves
space). This patch was using a process local map to track the first
few blocks and was reset as soon as we get the block with enough free
space. It was discussed in this thread that it would be better to
track the local map in relcache and then invalidate it whenever vacuum
frees up space in the page or when FSM is created. There is a
prototype patch written for the same, but it is not 100% clear to me
that the new idea would be a win in all cases (like code complexity or
API design-wise) especially because resetting the map is not
straight-forward. As time was not enough, we couldn't complete the
patch from all aspects to see if it is really better in all cases.

We have two options (a) revert this patch and try the new approach in
next release, (b) keep the current patch and replace with the new
approach if it turns out to be better in next release.

So, do we want to keep this feature for this release?

I am fine going with option (a), that's why I have prepared a revert
patch, but I have a slight fear that the other option might not turn
out to be better and even if it is then we can anyway replace it as
shown in the prototype, so going with option (b) doesn't sound to be
dumb.

Anybody else wants to weigh in?

I understand that we have to take a call here shortly, but as there is
a weekend so I would like to wait for another day to see if anyone
else wants to share his opinion.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#57

andres@anarazel.de

over 6 years ago

In reply to: Amit Kapila (#56)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-05-05 18:55:30 +0530, Amit Kapila wrote:

On Sat, May 4, 2019 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 3, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I am fine going with option (a), that's why I have prepared a revert
patch, but I have a slight fear that the other option might not turn
out to be better and even if it is then we can anyway replace it as
shown in the prototype, so going with option (b) doesn't sound to be
dumb.

I don't think we realistically can "anyway replace it as shown in the
prototype" - especially not if we discover we'd need to do so after (or
even close) to 12's release.

I understand that we have to take a call here shortly, but as there is
a weekend so I would like to wait for another day to see if anyone
else wants to share his opinion.

I still think that's the right course. I've previously stated that, so
I'm probably not fulfilling the "anyone else" criterion though.

Greetings,

Andres Freund

#58

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#57)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

On 2019-05-05 18:55:30 +0530, Amit Kapila wrote:

I understand that we have to take a call here shortly, but as there is
a weekend so I would like to wait for another day to see if anyone
else wants to share his opinion.

I still think that's the right course. I've previously stated that, so
I'm probably not fulfilling the "anyone else" criterion though.

I also prefer "revert and try again in v13", but I'm not "anyone else"
either ...

regards, tom lane

#59

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Kapila (#56)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Sun, May 5, 2019 at 9:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I understand that we have to take a call here shortly, but as there is
a weekend so I would like to wait for another day to see if anyone
else wants to share his opinion.

I haven't looked deeply into the issues with this patch, but it seems
to me that if two other committers is saying that this should be
reverted, and the original author of the patch is agreeing, and your
patch to try to fix it still has demonstrable bugs ... it's time to
give up. We're well past feature freeze at this point.

Some other random comments:

I'm really surprised that the original design of this patch involved
storing state in global variables. That seems like a pretty poor
decision. This is properly per-relation information, and any approach
like that isn't going to work well when there are multiple relations
involved, unless the information is only being used for a single
attempt to find a free page, in which case it should use parameters
and function-local variables, not globals.

I think it's legitimate to question whether sending additional
invalidation messages as part of the design of this feature is a good
idea. If it happens frequently, it could trigger expensive sinval
resets more often. I don't understand the various proposals well
enough to know whether that's really a problem, but if you've got a
lot of relations for which this optimization is in use, I'm not sure I
see why it couldn't be.

I think at some point it was proposed that, since an FSM access
involves touching 3 blocks, it ought to be fine for any relation of 4
or fewer blocks to just check all the others. I don't really
understand why we drifted off that design principle, because it seems
like a reasonable theory. Such an approach doesn't require anything
in the relcache, any global variables, or an every-other-page
algorithm.

If we wanted to avoid creating the FSM for relation with >4 blocks, we
might want to take a step back and think about a whole different
approach. For instance, we could have a cache in shared memory that
can store N entries, and not bother creating FSM forks until things no
longer fit in that cache. Or, what might be better, we could put FSM
data for many relations into a single FSM file, instead of having a
separate fork for each relation. I think that would get at what's
really driving this work: having a zillion tiny little FSM files
sucks. Of course, those kinds of changes are far too big to
contemplate for v12, but they might be worth some thought in the
future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#60

andres@anarazel.de

over 6 years ago

In reply to: Robert Haas (#59)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-05-06 11:10:15 -0400, Robert Haas wrote:

I'm really surprised that the original design of this patch involved
storing state in global variables. That seems like a pretty poor
decision. This is properly per-relation information, and any approach
like that isn't going to work well when there are multiple relations
involved, unless the information is only being used for a single
attempt to find a free page, in which case it should use parameters
and function-local variables, not globals.

I'm was too.

I think it's legitimate to question whether sending additional
invalidation messages as part of the design of this feature is a good
idea. If it happens frequently, it could trigger expensive sinval
resets more often. I don't understand the various proposals well
enough to know whether that's really a problem, but if you've got a
lot of relations for which this optimization is in use, I'm not sure I
see why it couldn't be.

I don't think it's an actual problem. We'd only do so when creating an
FSM, or when freeing up additional space that'd otherwise not be visible
to other backends. The alternative to sinval would thus be a) not
discovering there's free space and extending the relation b) checking
disk state for a new FSM all the time. Which are much more expensive.

I think at some point it was proposed that, since an FSM access
involves touching 3 blocks, it ought to be fine for any relation of 4
or fewer blocks to just check all the others. I don't really
understand why we drifted off that design principle, because it seems
like a reasonable theory. Such an approach doesn't require anything
in the relcache, any global variables, or an every-other-page
algorithm.

It's not that cheap to touch three heap blocks every time a new target
page is needed. Requires determining at least the target relation size
or the existance of the FSM fork.

We'll also commonly *not* end up touching 3 blocks in the FSM -
especially when there's actually no free space. And the FSM contents are
much less contended than the heap pages - the hot paths don't update the
FSM, and if so, the exclusive locks are held for a very short time only.

Greetings,

Andres Freund

#61

robertmhaas@gmail.com

over 6 years ago

In reply to: Andres Freund (#60)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Mon, May 6, 2019 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:

I think it's legitimate to question whether sending additional
invalidation messages as part of the design of this feature is a good
idea. If it happens frequently, it could trigger expensive sinval
resets more often. I don't understand the various proposals well
enough to know whether that's really a problem, but if you've got a
lot of relations for which this optimization is in use, I'm not sure I
see why it couldn't be.

I don't think it's an actual problem. We'd only do so when creating an
FSM, or when freeing up additional space that'd otherwise not be visible
to other backends. The alternative to sinval would thus be a) not
discovering there's free space and extending the relation b) checking
disk state for a new FSM all the time. Which are much more expensive.

None of that addresses the question of the distributed cost of sending
more sinval messages. If you have a million little tiny relations and
VACUUM goes through and clears one tuple out of each one, it will be
spewing sinval messages really, really fast. How can that fail to
threaten extra sinval resets?

I think at some point it was proposed that, since an FSM access
involves touching 3 blocks, it ought to be fine for any relation of 4
or fewer blocks to just check all the others. I don't really
understand why we drifted off that design principle, because it seems
like a reasonable theory. Such an approach doesn't require anything
in the relcache, any global variables, or an every-other-page
algorithm.

It's not that cheap to touch three heap blocks every time a new target
page is needed. Requires determining at least the target relation size
or the existance of the FSM fork.

We'll also commonly *not* end up touching 3 blocks in the FSM -
especially when there's actually no free space. And the FSM contents are
much less contended than the heap pages - the hot paths don't update the
FSM, and if so, the exclusive locks are held for a very short time only.

Well, that seems like an argument that we just shouldn't do this at
all. If the FSM is worthless for small relations, then eliding it
makes sense. But if having it is valuable even when the relation is
tiny, then eliding it is the wrong thing to do, isn't it? The
underlying concerns that prompted this patch probably have to do with
either [1] not wanting to have so many FSM forks on disk or [2] not
wanting to consume 24kB of space to track free space for a relation
that may be only 8kB. I think those goals are valid, but if we accept
your argument then this is the wrong way to achieve them.

I do find it a bit surprising that touching heap pages would be all
that much more expensive than touching FSM pages, but that doesn't
mean that it isn't the case. I would also note that this algorithm
ought to beat the FSM algorithm in many cases where there IS space
available, because you'll often find some usable free space on the
very first page you try, which will never happen with the FSM. The
case where the pages are all full doesn't seem very important, because
I don't see how you can stay in that situation for all that long.
Each time it happens, the relation grows by a block immediately
afterwards, and once it hits 5 blocks, it never happens again. I
guess you could incur the overhead repeatedly if the relation starts
out at 1 block, grows to 4, is vacuumed back down to 1, lather, rinse,
repeat, but is that actually realistic? It requires all the live
tuples to live in block 0 at the beginning of each vacuum cycle, which
seems like a fringe outcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Robert Haas (#61)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Robert Haas <robertmhaas@gmail.com> writes:

... I guess you could incur the overhead repeatedly if the relation starts
out at 1 block, grows to 4, is vacuumed back down to 1, lather, rinse,
repeat, but is that actually realistic?

While I've not studied the patch, I assumed that once a relation has an
FSM it won't disappear. Making it go away again if the relation gets
shorter seems both fairly useless and a promising source of bugs.

regards, tom lane

#63

robertmhaas@gmail.com

over 6 years ago

In reply to: Tom Lane (#62)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Mon, May 6, 2019 at 12:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... I guess you could incur the overhead repeatedly if the relation starts
out at 1 block, grows to 4, is vacuumed back down to 1, lather, rinse,
repeat, but is that actually realistic?

While I've not studied the patch, I assumed that once a relation has an
FSM it won't disappear. Making it go away again if the relation gets
shorter seems both fairly useless and a promising source of bugs.

Right, I think so too. That's not what I as going for, though. I was
trying to discuss a scenario where the relation repeatedly grows,
never reaching the size at which the FSM would be created, and then is
repeatedly truncated again.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#64

andres@anarazel.de

over 6 years ago

In reply to: Robert Haas (#61)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-05-06 11:52:12 -0400, Robert Haas wrote:

On Mon, May 6, 2019 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:

I think it's legitimate to question whether sending additional
invalidation messages as part of the design of this feature is a good
idea. If it happens frequently, it could trigger expensive sinval
resets more often. I don't understand the various proposals well
enough to know whether that's really a problem, but if you've got a
lot of relations for which this optimization is in use, I'm not sure I
see why it couldn't be.

I don't think it's an actual problem. We'd only do so when creating an
FSM, or when freeing up additional space that'd otherwise not be visible
to other backends. The alternative to sinval would thus be a) not
discovering there's free space and extending the relation b) checking
disk state for a new FSM all the time. Which are much more expensive.

None of that addresses the question of the distributed cost of sending
more sinval messages. If you have a million little tiny relations and
VACUUM goes through and clears one tuple out of each one, it will be
spewing sinval messages really, really fast. How can that fail to
threaten extra sinval resets?

Vacuum triggers sinval messages already (via the pg_class update),
shouldn't be too hard to ensure that there's no duplicate ones in this
case.

I think at some point it was proposed that, since an FSM access
involves touching 3 blocks, it ought to be fine for any relation of 4
or fewer blocks to just check all the others. I don't really
understand why we drifted off that design principle, because it seems
like a reasonable theory. Such an approach doesn't require anything
in the relcache, any global variables, or an every-other-page
algorithm.

It's not that cheap to touch three heap blocks every time a new target
page is needed. Requires determining at least the target relation size
or the existance of the FSM fork.

We'll also commonly *not* end up touching 3 blocks in the FSM -
especially when there's actually no free space. And the FSM contents are
much less contended than the heap pages - the hot paths don't update the
FSM, and if so, the exclusive locks are held for a very short time only.

Well, that seems like an argument that we just shouldn't do this at
all. If the FSM is worthless for small relations, then eliding it
makes sense. But if having it is valuable even when the relation is
tiny, then eliding it is the wrong thing to do, isn't it?

Why? The problem with the entirely stateless proposal is just that we'd
do that every single time we need new space. If we amortize that cost
across multiple insertions, I don't think there's a problem?

I do find it a bit surprising that touching heap pages would be all
that much more expensive than touching FSM pages, but that doesn't
mean that it isn't the case. I would also note that this algorithm
ought to beat the FSM algorithm in many cases where there IS space
available, because you'll often find some usable free space on the
very first page you try, which will never happen with the FSM.

Note that without additional state we do not *know* that the heap is 5
pages long, we have to do an smgrnblocks() - which is fairly
expensive. That's precisely why I want to keep state about a
non-existant FSM in the relcache, and why'd need sinval messages to
clear that. So we don't incur unnecessary syscalls when there's free
space.

I completely agree that avoiding the FSM for the small-rels case has the
potential to be faster, if we're not too naive about it. I think that
means

1) no checking of on-disk state for relation fork existance/sizes every
time looking up a page with free space
2) not re-scanning pages when we should know they're full (because we
scanned them for the last target page, in a previous insert)
3) ability to recognize concurrently freed space

The case where the pages are all full doesn't seem very important,
because I don't see how you can stay in that situation for all that
long. Each time it happens, the relation grows by a block immediately
afterwards, and once it hits 5 blocks, it never happens again.

I guess you could incur the overhead repeatedly if the relation starts
out at 1 block, grows to 4, is vacuumed back down to 1, lather, rinse,
repeat, but is that actually realistic? It requires all the live
tuples to live in block 0 at the beginning of each vacuum cycle, which
seems like a fringe outcome.

I think it's much more likely to be encountered when there's a lot of
churn on a small table, but HOT pruning removes just about all the
superflous space on a regular basis. Then the relation might actually
never get > 4 blocks.

Greetings,

Andres Freund

#65

robertmhaas@gmail.com

over 6 years ago

In reply to: Andres Freund (#64)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Mon, May 6, 2019 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:

None of that addresses the question of the distributed cost of sending
more sinval messages. If you have a million little tiny relations and
VACUUM goes through and clears one tuple out of each one, it will be
spewing sinval messages really, really fast. How can that fail to
threaten extra sinval resets?

Vacuum triggers sinval messages already (via the pg_class update),
shouldn't be too hard to ensure that there's no duplicate ones in this
case.

Yeah, if we can piggyback on the existing messages, then we can be
confident that we're not increasing the chances of sinval resets.

Well, that seems like an argument that we just shouldn't do this at
all. If the FSM is worthless for small relations, then eliding it
makes sense. But if having it is valuable even when the relation is
tiny, then eliding it is the wrong thing to do, isn't it?

Why? The problem with the entirely stateless proposal is just that we'd
do that every single time we need new space. If we amortize that cost
across multiple insertions, I don't think there's a problem?

Hmm, I see.

Note that without additional state we do not *know* that the heap is 5
pages long, we have to do an smgrnblocks() - which is fairly
expensive. That's precisely why I want to keep state about a
non-existant FSM in the relcache, and why'd need sinval messages to
clear that. So we don't incur unnecessary syscalls when there's free
space.

Makes sense.

I guess you could incur the overhead repeatedly if the relation starts
out at 1 block, grows to 4, is vacuumed back down to 1, lather, rinse,
repeat, but is that actually realistic? It requires all the live
tuples to live in block 0 at the beginning of each vacuum cycle, which
seems like a fringe outcome.

I think it's much more likely to be encountered when there's a lot of
churn on a small table, but HOT pruning removes just about all the
superflous space on a regular basis. Then the relation might actually
never get > 4 blocks.

Yeah, but if it leaves behind any tuples in block #3, the relation
will never be truncated. You can't repeatedly hit the
all-blocks-are-full case without repeatedly extending the relation,
and you can't repeatedly extend the relation without getting beyond 4
blocks unless you are also truncating it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

amit.kapila16@gmail.com

over 6 years ago

In reply to: Tom Lane (#58)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Mon, May 6, 2019 at 8:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-05-05 18:55:30 +0530, Amit Kapila wrote:

I understand that we have to take a call here shortly, but as there is
a weekend so I would like to wait for another day to see if anyone
else wants to share his opinion.

I still think that's the right course. I've previously stated that, so
I'm probably not fulfilling the "anyone else" criterion though.

I also prefer "revert and try again in v13", but I'm not "anyone else"
either ...

Fine, I will take care of that today.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#67

amit.kapila16@gmail.com

over 6 years ago

In reply to: Andres Freund (#60)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

On Mon, May 6, 2019 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-05-06 11:10:15 -0400, Robert Haas wrote:

I think it's legitimate to question whether sending additional
invalidation messages as part of the design of this feature is a good
idea. If it happens frequently, it could trigger expensive sinval
resets more often. I don't understand the various proposals well
enough to know whether that's really a problem, but if you've got a
lot of relations for which this optimization is in use, I'm not sure I
see why it couldn't be.

I don't think it's an actual problem. We'd only do so when creating an
FSM, or when freeing up additional space that'd otherwise not be visible
to other backends.

The other place we need to consider for this is when one of the
backends updates its map (due to unavailability of space in the
existing set of pages). We can choose not to send invalidation in
this case, but then different backends need to identify the same thing
themselves and reconstruct the map again.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#68

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Amit Kapila (#67)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Amit Kapila <amit.kapila16@gmail.com> writes:

On Mon, May 6, 2019 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-05-06 11:10:15 -0400, Robert Haas wrote:

I think it's legitimate to question whether sending additional
invalidation messages as part of the design of this feature is a good
idea.

I don't think it's an actual problem. We'd only do so when creating an
FSM, or when freeing up additional space that'd otherwise not be visible
to other backends.

The other place we need to consider for this is when one of the
backends updates its map (due to unavailability of space in the
existing set of pages). We can choose not to send invalidation in
this case, but then different backends need to identify the same thing
themselves and reconstruct the map again.

I'm inclined to wonder why bother with invals at all. The odds are
quite good that no other backend will care (which, I imagine, is the
reasoning behind why the original patch was designed like it was).
A table that has a lot of concurrent write activity on it is unlikely
to stay small enough to not have a FSM for long.

The approach I'm imagining here is not too different from Robert's
"just search the table's pages every time" straw-man. Backends would
cache the results of their own searches, but not communicate about it.

regards, tom lane

#69

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#68)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-05-07 09:34:42 -0400, Tom Lane wrote:

I'm inclined to wonder why bother with invals at all. The odds are
quite good that no other backend will care (which, I imagine, is the
reasoning behind why the original patch was designed like it was).
A table that has a lot of concurrent write activity on it is unlikely
to stay small enough to not have a FSM for long.

But when updating the free space for the first four blocks, we're going
to either have to do an smgrexists() to check whether somebody
concurrently created the FSM, or we might not update an existing FSM. An
inval seems much better.

Greetings,

Andres Freund

#70

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#69)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

On 2019-05-07 09:34:42 -0400, Tom Lane wrote:

I'm inclined to wonder why bother with invals at all.

But when updating the free space for the first four blocks, we're going
to either have to do an smgrexists() to check whether somebody
concurrently created the FSM, or we might not update an existing FSM. An
inval seems much better.

I do not think sinval messaging is going to be sufficient to avoid that
problem. sinval is only useful to tell you about changes if you first
take a lock strong enough to guarantee that no interesting change is
happening while you hold the lock. We are certainly not going to let
writes take an exclusive lock, so I don't see how we could be certain
that we've seen an sinval message telling us about FSM status change.

This seems tied into the idea we've occasionally speculated about
of tracking relation sizes in shared memory to avoid lseek calls.
If we could solve the problems around that, it'd provide a cheap(er)
way to detect whether an FSM should exist or not.

A different way of looking at it is that the FSM data is imprecise
by definition, therefore it doesn't matter that much if some backend
is slow to realize that the FSM exists. That still doesn't lead me
to think that sinval messaging is a component of the solution though.

regards, tom lane

#71

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#70)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Hi,

On 2019-05-07 12:04:11 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-05-07 09:34:42 -0400, Tom Lane wrote:

I'm inclined to wonder why bother with invals at all.

But when updating the free space for the first four blocks, we're going
to either have to do an smgrexists() to check whether somebody
concurrently created the FSM, or we might not update an existing FSM. An
inval seems much better.

I do not think sinval messaging is going to be sufficient to avoid that
problem. sinval is only useful to tell you about changes if you first
take a lock strong enough to guarantee that no interesting change is
happening while you hold the lock. We are certainly not going to let
writes take an exclusive lock, so I don't see how we could be certain
that we've seen an sinval message telling us about FSM status change.

Sure, but it'll be pretty darn close, rather than there basically not
being any limit except backend lifetime to how long we might not notice
that we'd need to switch to the on-disk FSM.

This seems tied into the idea we've occasionally speculated about
of tracking relation sizes in shared memory to avoid lseek calls.
If we could solve the problems around that, it'd provide a cheap(er)
way to detect whether an FSM should exist or not.

I'm back on working on a patch that provides that, FWIW. Not yet really
spending time on it, but I've re-skimmed through the code I'd previously
written.

Greetings,

Andres Freund

#72

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#71)

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Andres Freund <andres@anarazel.de> writes:

On 2019-05-07 12:04:11 -0400, Tom Lane wrote:

I do not think sinval messaging is going to be sufficient to avoid that
problem. sinval is only useful to tell you about changes if you first
take a lock strong enough to guarantee that no interesting change is
happening while you hold the lock. We are certainly not going to let
writes take an exclusive lock, so I don't see how we could be certain
that we've seen an sinval message telling us about FSM status change.

Sure, but it'll be pretty darn close, rather than there basically not
being any limit except backend lifetime to how long we might not notice
that we'd need to switch to the on-disk FSM.

Why do you think there's no limit? We ordinarily do
RelationGetNumberOfBlocks at least once per query on a table, and
I should think we could fix things so that a "free" side-effect of
that is to get the relcache entry updated with whether an FSM
ought to exist or not. So I think at worst we'd be one query behind.

regards, tom lane

#73