WIP: Avoid creation of the free space map for small tables

Started by John Naylorover 7 years ago181 messageshackers
Jump to latest
#1John Naylor
john.naylor@enterprisedb.com

Hi all,
A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]/messages/by-id/CA+Tgmoac+6qTNp2U+wedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww@mail.gmail.com. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]/messages/by-id/11360.1345502641@sss.pgh.pa.us. Attached is a WIP patch to implement that. I've
also attached a SQL script to demonstrate the change in behavior for
various scenarios.

The behavior that allows the simplest implementation I thought of is as follows:

-The FSM isn't created if the heap has fewer than 10 blocks (or
whatever). If the last known good block has insufficient space, try
every block before extending the heap.

-If a heap with a FSM is truncated back to below the threshold, the
FSM stays around and can be used as usual.

-If the heap tuples are all deleted, the FSM stays but has no leaf
blocks (same as on master). Although it exists, it won't be
re-extended until the heap re-passes the threshold.

--
Some notes:

-For normal mode, I taught fsm_set_and_search() to switch to a
non-extending buffer call, but the biggest missing piece is WAL
replay. I couldn't find a non-extending equivalent of
XLogReadBufferExtended(), so I might have to create one.

-There'll need to be some performance testing to make sure there's no
regression, and to choose a good value for the threshold. I'll look
into that, but if anyone has any ideas for tests, that'll help this
effort along.

-A possible TODO item is to teach pg_upgrade not to link FSMs for
small heaps. I haven't look into the feasibility of that, however.

-RelationGetBufferForTuple() now has two boolean variables that mean
"don't use the FSM", but with different behaviors. To avoid confusion,
I've renamed use_fsm to always_extend and revised the commentary
accordingly.

-I've only implemented this for heaps, because indexes (at least
B-tree) don't seem to be as eager to create a FSM. I haven't looked at
the code, however.

--
[1]: /messages/by-id/CA+Tgmoac+6qTNp2U+wedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww@mail.gmail.com
[2]: /messages/by-id/11360.1345502641@sss.pgh.pa.us

--
I'll add this to the November commitfest.

-John Naylor

Attachments:

fsmtest.sqlapplication/sql; name=fsmtest.sqlDownload
v1-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+79-18
#2Thomas Munro
thomas.munro@gmail.com
In reply to: John Naylor (#1)
Re: WIP: Avoid creation of the free space map for small tables

On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:

A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that. I've
also attached a SQL script to demonstrate the change in behavior for
various scenarios.

Hi John,

You'll need to tweak the test in contrib/pageinspect/sql/page.sql,
because it's currently asserting that there is an FSM on a small table
so make check-world fails.

--
Thomas Munro
http://www.enterprisedb.com

#3John Naylor
john.naylor@enterprisedb.com
In reply to: Thomas Munro (#2)
Re: WIP: Avoid creation of the free space map for small tables

On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:

A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that. I've
also attached a SQL script to demonstrate the change in behavior for
various scenarios.

Hi John,

You'll need to tweak the test in contrib/pageinspect/sql/page.sql,
because it's currently asserting that there is an FSM on a small table
so make check-world fails.

Whoops, sorry about that; the attached patch passes make check-world.
While looking into that, I also found a regression: If the cached
target block is the last block in the relation and there is no free
space, that block will be tried twice. That's been fixed as well.

Thanks,
-John Naylor

Attachments:

v2-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+169-75
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#3)
Re: WIP: Avoid creation of the free space map for small tables

John Naylor <jcnaylor@gmail.com> writes:

On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:

A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that.

BTW, don't we need a similar hack for visibility maps?

regards, tom lane

#5John Naylor
john.naylor@enterprisedb.com
In reply to: Tom Lane (#4)
Re: WIP: Avoid creation of the free space map for small tables

On 10/7/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <jcnaylor@gmail.com> writes:

On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:

A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that.

BTW, don't we need a similar hack for visibility maps?

The FSM is the bigger bang for the buck, and fairly simple to do, but
it would be nice to do something about VMs as well. I'm not sure if
simply lacking a VM would be as simple (or as free of downsides) as
for the FSM. I haven't studied the VM code in detail, however.

-John Naylor

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#1)
Re: WIP: Avoid creation of the free space map for small tables

On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:

-There'll need to be some performance testing to make sure there's no
regression, and to choose a good value for the threshold. I'll look
into that, but if anyone has any ideas for tests, that'll help this
effort along.

Can you try with a Copy command which copies just enough tuples to
fill the pages equivalent to HEAP_FSM_EXTENSION_THRESHOLD? It seems
to me in such a case patch will try each of the blocks multiple times.
It looks quite lame that we have to try again and again the blocks
which we have just filled by ourselves but may be that doesn't matter
much as the threshold value is small.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#1)
Re: WIP: Avoid creation of the free space map for small tables

On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:

Hi all,
A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that. I've
also attached a SQL script to demonstrate the change in behavior for
various scenarios.

The behavior that allows the simplest implementation I thought of is as follows:

-The FSM isn't created if the heap has fewer than 10 blocks (or
whatever). If the last known good block has insufficient space, try
every block before extending the heap.

-If a heap with a FSM is truncated back to below the threshold, the
FSM stays around and can be used as usual.

-If the heap tuples are all deleted, the FSM stays but has no leaf
blocks (same as on master). Although it exists, it won't be
re-extended until the heap re-passes the threshold.

--
Some notes:

-For normal mode, I taught fsm_set_and_search() to switch to a
non-extending buffer call, but the biggest missing piece is WAL
replay.

fsm_set_and_search()
{
..
+ /*
+ * For heaps we prevent extension of the FSM unless the number of pages
+ * exceeds
HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already
+ * have a FSM, this will save an inode and a few kB
of space.
+ * For sane threshold values, the FSM address will be zero, so we
+ * don't bother dealing with
anything else.
+ */
+ if (rel->rd_rel->relkind == RELKIND_RELATION
+ && addr.logpageno == 0)

I am not sure if this is a solid way to avoid creating FSM. What if
fsm_set_and_search gets called for the level other than 0? Also,
when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD,
then first time when vacuum will try to record the free space in the
page, won't it skip recording free space for first
HEAP_FSM_EXTENSION_THRESHOLD pages?

I think you have found a good way to avoid creating FSM, but can't we
use some simpler technique like if the FSM fork for a relation doesn't
exist, then check the heapblk number for which we try to update the
FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
creating the FSM.

I couldn't find a non-extending equivalent of
XLogReadBufferExtended(), so I might have to create one.

I think it would be better if we can find a common way to avoid
creating FSM both during DO and REDO time. It might be possible if
somethin like what I have said above is feasible.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#8John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#7)
Re: WIP: Avoid creation of the free space map for small tables

On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:

-For normal mode, I taught fsm_set_and_search() to switch to a
non-extending buffer call, but the biggest missing piece is WAL
replay.

fsm_set_and_search()
{
..
+ /*
+ * For heaps we prevent extension of the FSM unless the number of pages
+ * exceeds
HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already
+ * have a FSM, this will save an inode and a few kB
of space.
+ * For sane threshold values, the FSM address will be zero, so we
+ * don't bother dealing with
anything else.
+ */
+ if (rel->rd_rel->relkind == RELKIND_RELATION
+ && addr.logpageno == 0)

I am not sure if this is a solid way to avoid creating FSM. What if
fsm_set_and_search gets called for the level other than 0?

Thanks for taking a look. As for levels other than 0, I think that
only happens when fsm_set_and_search() is called by fsm_search(),
which will not cause extension.

Also,
when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD,
then first time when vacuum will try to record the free space in the
page, won't it skip recording free space for first
HEAP_FSM_EXTENSION_THRESHOLD pages?

Hmm, that's a good point.

I think you have found a good way to avoid creating FSM, but can't we
use some simpler technique like if the FSM fork for a relation doesn't
exist, then check the heapblk number for which we try to update the
FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
creating the FSM.

I think I see what you mean, but to avoid the vacuum problem you just
mentioned, we'd need to check the relation size, too. I've attached an
unpolished revision to do this. It seems to work, but I haven't tested
the vacuum issue yet. I'll do that and some COPY performance testing
in the next day or so. There's a bit more repetition than I would
like, so I'm not sure it's simpler - perhaps RecordPageWithFreeSpace()
could be turned into a wrapper around RecordAndGetPageWithFreeSpace().

Also new in this version, some non-functional improvements to hio.c:
-debugging calls that are #ifdef'd out.
-move some code out into a function instead of adding another goto.

I couldn't find a non-extending equivalent of
XLogReadBufferExtended(), so I might have to create one.

I think it would be better if we can find a common way to avoid
creating FSM both during DO and REDO time. It might be possible if
somethin like what I have said above is feasible.

That would be ideal.

-John Naylor

Attachments:

v3-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+244-101
#9John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#8)
Re: WIP: Avoid creation of the free space map for small tables

On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you have found a good way to avoid creating FSM, but can't we
use some simpler technique like if the FSM fork for a relation doesn't
exist, then check the heapblk number for which we try to update the
FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
creating the FSM.

I think it would be better if we can find a common way to avoid
creating FSM both during DO and REDO time. It might be possible if
somethin like what I have said above is feasible.

I've attached v4, which implements the REDO case, and as closely as
possible to the DO case. I've created a new function to guard against
creation of the FSM, which is called by RecordPageWithFreeSpace() and
RecordAndGetPageWithFreeSpace(). Since XLogRecordPageWithFreeSpace()
takes a relfilenode and not a relation, I had to reimplement that
separately, but the logic is basically the same. It works under
streaming replication.

I've also attached a couple SQL scripts which, when the aforementioned
DEBUG1 calls are enabled, show what the heap insert code is doing for
different scenarios. Make check-world passes.

-John Naylor

Attachments:

v4-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+229-82
test-fsm-first-10-blocks.sqlapplication/sql; name=test-fsm-first-10-blocks.sqlDownload
test-nofsm-first-block.sqlapplication/sql; name=test-nofsm-first-block.sqlDownload
#10Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#8)
Re: WIP: Avoid creation of the free space map for small tables

On Sun, Oct 14, 2018 at 1:09 AM John Naylor <jcnaylor@gmail.com> wrote:

On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you have found a good way to avoid creating FSM, but can't we
use some simpler technique like if the FSM fork for a relation doesn't
exist, then check the heapblk number for which we try to update the
FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
creating the FSM.

I think I see what you mean, but to avoid the vacuum problem you just
mentioned, we'd need to check the relation size, too.

Sure, but vacuum already has relation size. In general, I think we
should try to avoid adding more system calls in the common code path.
It can impact the performance.

Few comments on your latest patch:
-
+static bool
+allow_write_to_fsm(Relation rel, BlockNumber heapBlk)
+{
+ BlockNumber heap_nblocks;
+
+ if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD ||
+ rel->rd_rel->relkind != RELKIND_RELATION)
+ return true;
+
+ /* XXX is this value cached? */
+ heap_nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD)
+ return true;
+ else
+ {
+ RelationOpenSmgr(rel);
+ return smgrexists(rel->rd_smgr, FSM_FORKNUM);
+ }
+}

I think you can avoid calling RelationGetNumberOfBlocks, if you call
smgrexists before and for the purpose of vacuum, we can get that as an
input parameter. I think one can argue for not changing the interface
functions like RecordPageWithFreeSpace to avoid calling
RelationGetNumberOfBlocks, but to me, it appears worth to save the
additional system call.

-
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
-
- /*
- * If the FSM knows nothing of the rel, try the last page before we
- * give up and extend.  This avoids one-tuple-per-page syndrome during
- * bootstrapping or in a recently-started system.
- */
  if (targetBlock == InvalidBlockNumber)
- {
- BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-
- if (nblocks > 0)
- targetBlock = nblocks - 1;
- }
+ targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
+   &try_every_page);

Is it possible to hide the magic of trying each block within
GetPageWithFreeSpace? It will simplify the code and in future, if
another storage API has a different function for
RelationGetBufferForTuple, it will work seamlessly, provided they are
using same FSM. One such user is zheap.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#10)
Re: WIP: Avoid creation of the free space map for small tables

On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments on your latest patch:
-
+static bool
+allow_write_to_fsm(Relation rel, BlockNumber heapBlk)
+{
+ BlockNumber heap_nblocks;
+
+ if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD ||
+ rel->rd_rel->relkind != RELKIND_RELATION)
+ return true;
+
+ /* XXX is this value cached? */
+ heap_nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD)
+ return true;
+ else
+ {
+ RelationOpenSmgr(rel);
+ return smgrexists(rel->rd_smgr, FSM_FORKNUM);
+ }
+}

I think you can avoid calling RelationGetNumberOfBlocks, if you call
smgrexists before

Okay, I didn't know which was cheaper, but I'll check smgrexists
first. Thanks for the info.

and for the purpose of vacuum, we can get that as an
input parameter. I think one can argue for not changing the interface
functions like RecordPageWithFreeSpace to avoid calling
RelationGetNumberOfBlocks, but to me, it appears worth to save the
additional system call.

I agree, and that should be fairly straightforward. I'll include that
in the next patch.

-
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
-
- /*
- * If the FSM knows nothing of the rel, try the last page before we
- * give up and extend.  This avoids one-tuple-per-page syndrome during
- * bootstrapping or in a recently-started system.
- */
if (targetBlock == InvalidBlockNumber)
- {
- BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-
- if (nblocks > 0)
- targetBlock = nblocks - 1;
- }
+ targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
+   &try_every_page);

Is it possible to hide the magic of trying each block within
GetPageWithFreeSpace? It will simplify the code and in future, if
another storage API has a different function for
RelationGetBufferForTuple, it will work seamlessly, provided they are
using same FSM. One such user is zheap.

Hmm, here I'm a bit more skeptical about the trade offs. That would
mean, in effect, to put a function called get_page_no_fsm() in the FSM
code. ;-) I'm willing to be convinced otherwise, of course, but
here's my reasoning:

For one, we'd have to pass prevBlockNumber and &try_every_block to
GetPageWithFreeSpace() (and RecordAndGetPageWithFreeSpace() by
extension), which are irrelevant to some callers. In addition, in
hio.c, there is a call where we don't want to try any blocks that we
have already, much less all of them:

/*
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);

By the time we get to this call, we likely wouldn't trigger the logic
to try every block, but I don't think we can guarantee that. We could
add a boolean parameter that means "consider trying every block", but
I don't think the FSM code should have so much state passed to it.

Thanks for reviewing,
-John Naylor

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#11)
Re: WIP: Avoid creation of the free space map for small tables

On Mon, Oct 15, 2018 at 4:09 PM John Naylor <jcnaylor@gmail.com> wrote:

On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments on your latest patch:
-
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
-
- /*
- * If the FSM knows nothing of the rel, try the last page before we
- * give up and extend.  This avoids one-tuple-per-page syndrome during
- * bootstrapping or in a recently-started system.
- */
if (targetBlock == InvalidBlockNumber)
- {
- BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-
- if (nblocks > 0)
- targetBlock = nblocks - 1;
- }
+ targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
+   &try_every_page);

Is it possible to hide the magic of trying each block within
GetPageWithFreeSpace? It will simplify the code and in future, if
another storage API has a different function for
RelationGetBufferForTuple, it will work seamlessly, provided they are
using same FSM. One such user is zheap.

Hmm, here I'm a bit more skeptical about the trade offs. That would
mean, in effect, to put a function called get_page_no_fsm() in the FSM
code. ;-) I'm willing to be convinced otherwise, of course, but
here's my reasoning:

For one, we'd have to pass prevBlockNumber and &try_every_block to
GetPageWithFreeSpace() (and RecordAndGetPageWithFreeSpace() by
extension), which are irrelevant to some callers.

I think we can avoid using prevBlockNumber and try_every_block, if we
maintain a small cache which tells whether the particular block is
tried or not. What I am envisioning is that while finding the block
with free space, if we came to know that the relation in question is
small enough that it doesn't have FSM, we can perform a local_search.
In local_seach, we can enquire the cache for any block that we can try
and if we find any block, we can try inserting in that block,
otherwise, we need to extend the relation. One simple way to imagine
such a cache would be an array of structure and structure has blkno
and status fields. After we get the usable block, we need to clear
the cache, if exists.

In addition, in
hio.c, there is a call where we don't want to try any blocks that we
have already, much less all of them:

/*
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);

By the time we get to this call, we likely wouldn't trigger the logic
to try every block, but I don't think we can guarantee that.

I think the current code as proposed has that limitation, but if we
have a cache, then we can check if the relation has actually extended
after taking the lock and if so we can try only newly added block/'s.

I am not completely sure if the idea described above is certainly
better, but it seems that it will be clean and can handle some of the
cases with ease.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#10)
Re: WIP: Avoid creation of the free space map for small tables

On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can avoid calling RelationGetNumberOfBlocks, if you call
smgrexists before

This is done in the attached v5, 0001.

and for the purpose of vacuum, we can get that as an
input parameter. I think one can argue for not changing the interface
functions like RecordPageWithFreeSpace to avoid calling
RelationGetNumberOfBlocks, but to me, it appears worth to save the
additional system call.

This is done in 0002. I also added a check for the cached value of
pg_class.relpages, since it's cheap and may help non-VACUUM callers.

[proposal for a cache of blocks to try]

That's interesting. I'll have to do some reading elsewhere in the
codebase, and then I'll follow up.

Thanks,
-John Naylor

Attachments:

v5-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+236-82
v5-0002-Add-parameter-nblocks-to-RecordPageWithFreeSpace.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Add-parameter-nblocks-to-RecordPageWithFreeSpace.patchDownload+44-23
#14Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#13)
Re: WIP: Avoid creation of the free space map for small tables

On Tue, Oct 16, 2018 at 4:27 PM John Naylor <jcnaylor@gmail.com> wrote:

[proposal for a cache of blocks to try]

That's interesting. I'll have to do some reading elsewhere in the
codebase, and then I'll follow up.

Thanks, I have changed the status of this patch as "Waiting on
Author". Feel free to change it once you have a new patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#15John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#12)
Re: WIP: Avoid creation of the free space map for small tables

On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we can avoid using prevBlockNumber and try_every_block, if we
maintain a small cache which tells whether the particular block is
tried or not. What I am envisioning is that while finding the block
with free space, if we came to know that the relation in question is
small enough that it doesn't have FSM, we can perform a local_search.
In local_seach, we can enquire the cache for any block that we can try
and if we find any block, we can try inserting in that block,
otherwise, we need to extend the relation. One simple way to imagine
such a cache would be an array of structure and structure has blkno
and status fields. After we get the usable block, we need to clear
the cache, if exists.

Here is the design I've implemented in the attached v6. There is more
code than v5, but there's a cleaner separation between freespace.c and
hio.c, as you preferred. I also think it's more robust. I've expended
some effort to avoid doing unnecessary system calls to get the number
of blocks.
--

For the local, in-memory map, maintain a static array of status
markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block
number. This is populated every time we call GetPageWithFreeSpace() on
small tables with no FSM. The statuses are

'zero' (beyond the relation)
'available to try'
'tried already'

Example for a 4-page heap:

01234567
AAAA0000

If we try block 3 and there is no space, we set it to 'tried' and next
time through the loop we'll try 2, etc:

01234567
AAAT0000

If we try all available blocks, we will extend the relation. As in the
master branch, first we call GetPageWithFreeSpace() again to see if
another backend extended the relation to 5 blocks while we were
waiting for the lock. If we find a new block, we will mark the new
block available and leave the rest alone:

01234567
TTTTA000

On the odd chance we still can't insert into the new block, we'll skip
checking any others and we'll redo the logic to extend the relation.

If we're about to successfully return a buffer, whether from an
existing block, or by extension, we clear the local map.

Once this is in shape, I'll do some performance testing.

-John Naylor

Attachments:

v6-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+344-84
#16John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#15)
Re: WIP: Avoid creation of the free space map for small tables

I wrote:

Once this is in shape, I'll do some performance testing.

On second thought, there's no point in waiting, especially if a
regression points to a design flaw.

I compiled patched postgres with HEAP_FSM_CREATION_THRESHOLD set to
32, then ran the attached script which populates 100 tables with
varying numbers of blocks. I wanted a test that created pages eagerly
and wrote to disk as little as possible. Config was stock, except for
fsync = off. I took the average of 10 runs after removing the slowest
and fastest run:

# blocks master patch
4 36.4ms 33.9ms
8 50.6ms 48.9ms
12 58.6ms 66.3ms
16 65.5ms 81.4ms

It seems under these circumstances a threshold of up to 8 performs
comparably to the master branch, with small block numbers possibly
faster than with the FSM, provided they're in shared buffers already.
I didn't bother testing higher values because it's clear there's a
regression starting around 10 or so, beyond which it helps to have the
FSM.

A case could be made for setting the threshold to 4, since not having
3 blocks of FSM in shared buffers exactly makes up for the 3 other
blocks of heap that are checked when free space runs out.

I can run additional tests if there's interest.

-John Naylor

Attachments:

fsm-copy-test.sqlapplication/sql; name=fsm-copy-test.sqlDownload
#17John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#16)
Re: WIP: Avoid creation of the free space map for small tables

Upthread I wrote:

-A possible TODO item is to teach pg_upgrade not to link FSMs for
small heaps. I haven't look into the feasibility of that, however.

This turned out to be relatively light weight (0002 attached). I had
to add relkind to the RelInfo struct and save the size of each heap as
its transferred. The attached SQL script will setup a couple test
cases to demonstrate pg_upgrade. Installations with large numbers of
small tables will be able to see space savings right away.

For 0001, I adjusted the README and docs, and also made some cosmetic
improvements to the code, mostly in the comments. I've set the
commitfest entry back to 'needs review'

One thing I noticed is that one behavior on master hasn't changed:
System catalogs created during bootstrap still have a FSM if they have
any data. Possibly related, catalogs also have a VM even if they have
no data at all. This isn't anything to get excited about, but it would
be nice to investigate, at least so it can be documented. A cursory
dig hasn't found the cause, but I'll keep doing that as time permits.

-John Naylor

Attachments:

v7-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patchDownload+360-95
v7-0002-During-pg_upgrade-skip-transfer-of-FSMs-if-they-w.patchtext/x-patch; charset=US-ASCII; name=v7-0002-During-pg_upgrade-skip-transfer-of-FSMs-if-they-w.patchDownload+47-26
setup-fsm-pg_upgrade.sqlapplication/sql; name=setup-fsm-pg_upgrade.sqlDownload
#18Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#17)
Re: WIP: Avoid creation of the free space map for small tables

On Wed, Oct 31, 2018 at 1:42 PM John Naylor <jcnaylor@gmail.com> wrote:

Upthread I wrote:

-A possible TODO item is to teach pg_upgrade not to link FSMs for
small heaps. I haven't look into the feasibility of that, however.

This turned out to be relatively light weight (0002 attached). I had
to add relkind to the RelInfo struct and save the size of each heap as
its transferred. The attached SQL script will setup a couple test
cases to demonstrate pg_upgrade. Installations with large numbers of
small tables will be able to see space savings right away.

For 0001, I adjusted the README and docs, and also made some cosmetic
improvements to the code, mostly in the comments. I've set the
commitfest entry back to 'needs review'

One thing I noticed is that one behavior on master hasn't changed:
System catalogs created during bootstrap still have a FSM if they have
any data. Possibly related, catalogs also have a VM even if they have
no data at all. This isn't anything to get excited about, but it would
be nice to investigate, at least so it can be documented. A cursory
dig hasn't found the cause, but I'll keep doing that as time permits.

Thanks for your work on this, I will try to review it during CF.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: John Naylor (#16)
Re: WIP: Avoid creation of the free space map for small tables

On Tue, Oct 23, 2018 at 9:42 AM John Naylor <jcnaylor@gmail.com> wrote:

A case could be made for setting the threshold to 4, since not having
3 blocks of FSM in shared buffers exactly makes up for the 3 other
blocks of heap that are checked when free space runs out.

That doesn't seem like an unreasonable argument. I'm not sure whether
the right threshold is 4 or something a little bigger, but I bet it's
not very large. It seems important to me that before anybody thinks
about committing this, we construct some kind of destruction case
where repeated scans of the whole table are triggered as frequently as
possible, and then run that test with varying thresholds. I might be
totally wrong, but I bet with a value as large as 32 you will be able
to find cases where it regresses in a big way.

We also need to think about what happens on the standby, where the FSM
is updated in a fairly different way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#19)
Re: WIP: Avoid creation of the free space map for small tables

On Wed, Oct 31, 2018 at 10:29 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 23, 2018 at 9:42 AM John Naylor <jcnaylor@gmail.com> wrote:

A case could be made for setting the threshold to 4, since not having
3 blocks of FSM in shared buffers exactly makes up for the 3 other
blocks of heap that are checked when free space runs out.

That doesn't seem like an unreasonable argument. I'm not sure whether
the right threshold is 4 or something a little bigger, but I bet it's
not very large. It seems important to me that before anybody thinks
about committing this, we construct some kind of destruction case
where repeated scans of the whole table are triggered as frequently as
possible, and then run that test with varying thresholds.

Why do you think repeated scans will be a destruction case when there
is no FSM for a small table?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#22)
#24John Naylor
john.naylor@enterprisedb.com
In reply to: Robert Haas (#23)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#21)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#22)
#27Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#23)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#27)
#29John Naylor
john.naylor@enterprisedb.com
In reply to: Robert Haas (#19)
#30John Naylor
john.naylor@enterprisedb.com
In reply to: Robert Haas (#23)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#15)
#32John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#31)
#33Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#32)
#34John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#33)
#35Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#34)
#36John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#34)
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#36)
#38John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#31)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#38)
#40John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#39)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#40)
#42John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#41)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#42)
#44John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#42)
#45Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#44)
#46John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#45)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#42)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#47)
#49John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#47)
#50John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#48)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#50)
#52John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#51)
#53Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#52)
#54John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#53)
#55Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#54)
#56John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#55)
#57John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#39)
#58Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#57)
#59Mithun Cy
mithun.cy@enterprisedb.com
In reply to: Amit Kapila (#55)
#60John Naylor
john.naylor@enterprisedb.com
In reply to: Mithun Cy (#59)
#61Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#60)
#62Mithun Cy
mithun.cy@enterprisedb.com
In reply to: John Naylor (#52)
#63Mithun Cy
mithun.cy@enterprisedb.com
In reply to: John Naylor (#60)
#64Mithun Cy
mithun.cy@enterprisedb.com
In reply to: John Naylor (#60)
#65Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#63)
#66John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#65)
#67John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#66)
#68Mithun Cy
mithun.cy@enterprisedb.com
In reply to: John Naylor (#66)
#69John Naylor
john.naylor@enterprisedb.com
In reply to: Mithun Cy (#68)
#70Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#69)
#71John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#70)
#72Mithun Cy
mithun.cy@enterprisedb.com
In reply to: John Naylor (#71)
#73Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#71)
#74Amit Kapila
amit.kapila16@gmail.com
In reply to: Mithun Cy (#72)
#75John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#73)
#76John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#75)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#76)
#78Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#77)
#79Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#75)
#80John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#79)
#81Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#80)
#82John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#81)
#83Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#82)
#84John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#83)
#85Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#82)
#86John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#85)
#87John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#83)
#88Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#86)
#89Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#87)
#90Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#89)
#91John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#89)
#92Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#91)
#93John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#92)
#94John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#92)
#95Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#93)
#96John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#95)
#97Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#88)
#98John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#97)
#99Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#98)
#100John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#99)
#101Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#100)
#102Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#100)
#103Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#102)
#104Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Amit Kapila (#103)
#105John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#102)
#106Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#105)
#107Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#106)
#108Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#106)
#109Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#108)
#110John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#107)
#111Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#110)
#112John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#111)
#113Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#112)
#114John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#113)
#115Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#111)
#116Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#114)
#117Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#115)
#118John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#116)
#119John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#117)
#120Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#118)
#121Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#119)
#122John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#121)
#123John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#122)
#124Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#117)
#125Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#123)
#126John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#125)
#127Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#126)
#128Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#102)
#129Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#128)
#130Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#120)
#131John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#130)
#132Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#131)
#133Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#132)
#134Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#133)
#135John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#132)
#136Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#135)
#137Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#136)
#138John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#137)
#139Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#138)
#140John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#139)
#141Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#140)
#142John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#141)
#143Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#142)
#144John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#143)
#145Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#142)
#146Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#145)
#147Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#146)
#148Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#147)
#149Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#142)
#150John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#149)
#151Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#148)
#152Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#151)
#153Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#152)
#154Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#153)
#155Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#153)
In reply to: Alvaro Herrera (#155)
#157Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#156)
#158John Naylor
john.naylor@enterprisedb.com
In reply to: Alvaro Herrera (#152)
#159John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#153)
#160Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: John Naylor (#159)
#161Petr Jelinek
petr@2ndquadrant.com
In reply to: Alvaro Herrera (#160)
#162John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#92)
#163Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#158)
#164Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#96)
#165Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#162)
#166Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#164)
#167John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#164)
#168Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#163)
#169Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#167)
#170John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#164)
#171Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#170)
#172John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#171)
#173Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#172)
#174John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#173)
#175Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#174)
#176John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#175)
#177Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#176)
#178Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#177)
#179John Naylor
john.naylor@enterprisedb.com
In reply to: Amit Kapila (#178)
#180Amit Kapila
amit.kapila16@gmail.com
In reply to: John Naylor (#179)
#181Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#178)