Unhappy about API changes in the no-fsm-for-small-rels patch
Hi,
I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.
extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+ bool check_fsm_only);
So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.
@@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
* Note that if the new spaceAvail value is higher than the old value stored
* in the FSM, the space might not become visible to searchers until the next
* FreeSpaceMapVacuum call, which updates the upper level pages.
+ *
+ * Callers have no need for a local map.
*/
void
-RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
+RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
+ Size spaceAvail, BlockNumber nblocks)
There's no explanation as to what that "nblocks" argument is. One
basically has to search other callers to figure it out. It's not even
clear to which fork it relates to. Nor that one can set it to
InvalidBlockNumber if one doesn't have the relation size conveniently
reachable. But it's not exposed to RecordAndGetPageWithFreeSpace(), for
a basically unexplained reason. There's a comment above
fsm_allow_writes() - but that's file-local function that external
callers basically have need to know about.
I can't figure out what "Callers have no need for a local map." is
supposed to mean.
+/*
+ * Clear the local map. We must call this when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ */
+void
+FSMClearLocalMap(void)
+{
+ if (FSM_LOCAL_MAP_EXISTS)
+ {
+ fsm_local_map.nblocks = 0;
+ memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
+ sizeof(fsm_local_map.map));
+ }
+}
+
So now there's a new function one needs to call after successfully using
the block returned by [RecordAnd]GetPageWithFreeSpace(). But it's not
referenced from those functions, so basically one has to just know that.
+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4
Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.
I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.
+static void
+fsm_local_set(Relation rel, BlockNumber cur_nblocks)
+{
+ BlockNumber blkno,
+ cached_target_block;
+
+ /* The local map must not be set already. */
+ Assert(!FSM_LOCAL_MAP_EXISTS);
+
+ /*
+ * Starting at the current last block in the relation and working
+ * backwards, mark alternating blocks as available.
+ */
+ blkno = cur_nblocks - 1;
That comment explains very little about why this is done, and why it's a
good idea.
+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL 0x01
+/* Local map of block numbers for small heaps with no FSM. */
+typedef struct
+{
+ BlockNumber nblocks;
+ uint8 map[HEAP_FSM_CREATION_THRESHOLD];
+} FSMLocalMap;
+
Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.
+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+ BlockNumber nblocks, BlockNumber *get_nblocks)
+ if (rel->rd_rel->relpages != InvalidBlockNumber &&
+ rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
+ return true;
+ else
+ skip_get_nblocks = false;
+ }
This badly needs a comment explaining that these values can be basically
arbitrarily out of date. Explaining why it's correct to rely on them
anyway (Presumably because creating an fsm unnecessarily is ok, it just
avoid susing this optimization).
+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+ BlockNumber nblocks, BlockNumber *get_nblocks)
+ RelationOpenSmgr(rel);
+ if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ return true;
Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?
I'm kinda thinking that this is the wrong architecture.
1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation. That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.
2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().
3) The global variable based approach means that we cannot easily do
better. Even if we had a cache that lives across
RelationGetBufferForTuple() calls.
4) fsm_allow_writes() does a smgrexists() for the FSM in some
cases. That's pretty darn expensive if it's already open.
I think if we want to keep something like this feature, we'd need to
cache the relation size in a variable similar to how we cache the FSM
size (like SMgrRelationData.smgr_fsm_nblocks) *and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache. If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
I'm kinda thinking that this is the wrong architecture.
The bits of that patch that I've looked at seemed like a mess
to me too. AFAICT, it's trying to use a single global "map"
for all relations (strike 1) without any clear tracking of
which relation the map currently describes (strike 2).
This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is, and in any case it seems pretty
inefficient for workloads that insert into multiple tables.
I'd have expected any such map to be per-table and be stored in
the relcache.
regards, tom lane
Hi,
On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I'm kinda thinking that this is the wrong architecture.
The bits of that patch that I've looked at seemed like a mess
to me too. AFAICT, it's trying to use a single global "map"
for all relations (strike 1) without any clear tracking of
which relation the map currently describes (strike 2).
Well, strike 2 basically is not a problem right now, because the map is
cleared whenever a search for a target buffer succeeded. But that has
pretty obvious efficiency issues...
This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is
I think it better needs to be fail-soft independent of this the no-fsm
patch. Because the fsm is not WAL logged etc, it's pretty easy to get a
pretty corrupted version. And we better deal with that.
and in any case it seems pretty inefficient for workloads that insert
into multiple tables.
As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.
I'd have expected any such map to be per-table and be stored in
the relcache.
Same.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is
I think it better needs to be fail-soft independent of this the no-fsm
patch. Because the fsm is not WAL logged etc, it's pretty easy to get a
pretty corrupted version. And we better deal with that.
Yes, FSM has to be fail-soft from a *correctness* viewpoint; but it's
not fail-soft from a *performance* viewpoint. It can take awhile for
us to self-heal a busted map. And this fake map spends almost all its
time busted and in need of (expensive) corrections. I think this may
actually be the same performance complaint you're making, in different
words.
regards, tom lane
On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk); -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded); +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded, + bool check_fsm_only);So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.
When first looking for free space, it's "false": Within
GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
returns invalid.
If we have to extend, after acquiring the lock to extend the relation,
we call GetPageWithFreeSpace() again to see if another backend already
extended while waiting on the lock. If there's no FSM, the thinking
is, it's not worth it to get the number of blocks again.
@@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage, * Note that if the new spaceAvail value is higher than the old value stored * in the FSM, the space might not become visible to searchers until the next * FreeSpaceMapVacuum call, which updates the upper level pages. + * + * Callers have no need for a local map. */ void -RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail) +RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, + Size spaceAvail, BlockNumber nblocks)There's no explanation as to what that "nblocks" argument is. One
basically has to search other callers to figure it out. It's not even
clear to which fork it relates to. Nor that one can set it to
InvalidBlockNumber if one doesn't have the relation size conveniently
reachable. But it's not exposed to RecordAndGetPageWithFreeSpace(), for
a basically unexplained reason. There's a comment above
fsm_allow_writes() - but that's file-local function that external
callers basically have need to know about.
Okay.
I can't figure out what "Callers have no need for a local map." is
supposed to mean.
It was meant to contrast with [RecordAnd]GetPageWithFreeSpace(), but I
see how it's confusing.
+/* + * Clear the local map. We must call this when we have found a block with + * enough free space, when we extend the relation, or on transaction abort. + */ +void +FSMClearLocalMap(void) +{ + if (FSM_LOCAL_MAP_EXISTS) + { + fsm_local_map.nblocks = 0; + memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL, + sizeof(fsm_local_map.map)); + } +} +So now there's a new function one needs to call after successfully using
the block returned by [RecordAnd]GetPageWithFreeSpace(). But it's not
referenced from those functions, so basically one has to just know that.
Right.
+/* Only create the FSM if the heap has greater than this many blocks */ +#define HEAP_FSM_CREATION_THRESHOLD 4Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.
Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.
I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.+static void +fsm_local_set(Relation rel, BlockNumber cur_nblocks) +{ + BlockNumber blkno, + cached_target_block; + + /* The local map must not be set already. */ + Assert(!FSM_LOCAL_MAP_EXISTS); + + /* + * Starting at the current last block in the relation and working + * backwards, mark alternating blocks as available. + */ + blkno = cur_nblocks - 1;That comment explains very little about why this is done, and why it's a
good idea.
Short answer: performance -- it's too expensive to try every block.
The explanation is in storage/freespace/README -- maybe that should be
referenced here?
+/* Status codes for the local map. */ + +/* Either already tried, or beyond the end of the relation */ +#define FSM_LOCAL_NOT_AVAIL 0x00 + +/* Available to try */ +#define FSM_LOCAL_AVAIL 0x01+/* Local map of block numbers for small heaps with no FSM. */ +typedef struct +{ + BlockNumber nblocks; + uint8 map[HEAP_FSM_CREATION_THRESHOLD]; +} FSMLocalMap; +Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.
I fail to see the advantage of that.
+static bool +fsm_allow_writes(Relation rel, BlockNumber heapblk, + BlockNumber nblocks, BlockNumber *get_nblocks)+ if (rel->rd_rel->relpages != InvalidBlockNumber && + rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD) + return true; + else + skip_get_nblocks = false; + }This badly needs a comment explaining that these values can be basically
arbitrarily out of date. Explaining why it's correct to rely on them
anyway (Presumably because creating an fsm unnecessarily is ok, it just
avoid susing this optimization).
Agreed, and yes, your presumption is what I had in mind.
I'm kinda thinking that this is the wrong architecture.
1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation.
That was true previously anyway if the FSM returned InvalidBlockNumber.
That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.
Could you explain why it would succeed less often?
2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().
Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.
3) The global variable based approach means that we cannot easily do
better. Even if we had a cache that lives across
RelationGetBufferForTuple() calls.4) fsm_allow_writes() does a smgrexists() for the FSM in some
cases. That's pretty darn expensive if it's already open.
(from earlier)
Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?
Hmm, I can look into that.
On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
<snip>
and in any case it seems pretty inefficient for workloads that insert
into multiple tables.As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.
Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 17, 2019 at 10:39 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
+/* Only create the FSM if the heap has greater than this many blocks */ +#define HEAP_FSM_CREATION_THRESHOLD 4Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.
I think we also need to take care of TPD pages along with meta page.
This might be less effective if we encounter TPD pages as well in
small relation which shouldn't be a common scenario, but it won't
hurt, otherwise. Those pages are anyway temporary and will be
removed.
BTW there is one other thing which is tied to heap in FSM for which we
might want some handling.
#define MaxFSMRequestSize MaxHeapTupleSize
In general, it will be good if we find some pluggable way for both the
defines, otherwise, also, it shouldn't cause a big problem.
I'm kinda thinking that this is the wrong architecture.
1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation.That was true previously anyway if the FSM returned InvalidBlockNumber.
That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.Could you explain why it would succeed less often?
2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.3) The global variable based approach means that we cannot easily do
better. Even if we had a cache that lives across
RelationGetBufferForTuple() calls.4) fsm_allow_writes() does a smgrexists() for the FSM in some
cases. That's pretty darn expensive if it's already open.(from earlier)
Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?Hmm, I can look into that.
I think if we want to keep something like this feature, we'd need to
cache the relation size in a variable similar to how we cache the FSM
size (like SMgrRelationData.smgr_fsm_nblocks)
makes sense. I think we should do this unless we face any problem with it.
*and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache.
I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation. We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation. The current mechanism addresses both this and normal case
where there are not many deletions. Sure there is some overhead of
building the map again and rechecking each page. The first one is a
memory operation and takes a few cycles and for the second we
optimized by checking the pages alternatively which means we won't
check more than two pages at-a-time. This cost is paid by not
checking FSM and it could be somewhat better in some cases [1]/messages/by-id/CAD__Oui5+qiVxJSJqiXq2jA60QV8PKxrZA8_W+cCxROGAFJMWA@mail.gmail.com.
If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?
Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.
OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.
On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
<snip>
and in any case it seems pretty inefficient for workloads that insert
into multiple tables.As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.
During development, we were also worried about the performance
regression that can happen due to this patch and we have done many
rounds of performance tests where such a cache could be accessed
pretty frequently. In the original version, we do see a small
regression as a result of which we came up with an alternate strategy
of not checking every page. If you want, I can share the links of
emails for performance testing.
Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.
In fact, we have seen some wins. See the performance testing done [1]/messages/by-id/CAD__Oui5+qiVxJSJqiXq2jA60QV8PKxrZA8_W+cCxROGAFJMWA@mail.gmail.com
with various approaches during development.
Added this as an open item.
[1]: /messages/by-id/CAD__Oui5+qiVxJSJqiXq2jA60QV8PKxrZA8_W+cCxROGAFJMWA@mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2019-04-17 13:09:05 +0800, John Naylor wrote:
On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk); -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded); +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded, + bool check_fsm_only);So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.When first looking for free space, it's "false": Within
GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
returns invalid.If we have to extend, after acquiring the lock to extend the relation,
we call GetPageWithFreeSpace() again to see if another backend already
extended while waiting on the lock. If there's no FSM, the thinking
is, it's not worth it to get the number of blocks again.
I can get that (after reading through the code, grepping through all
callers, etc), but it means that every callsite needs to understand
that. That's making the API more complicated than needed, especially
when we're going to grow more callers.
+/* Only create the FSM if the heap has greater than this many blocks */ +#define HEAP_FSM_CREATION_THRESHOLD 4Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.
My complaint is basically that it's apparently AM specific (we don't use
the logic for e.g. indexes), and that the name suggest it's specific to
heap. And it's not controllable by the outside, which means it can't be
tuned for the specific usecase.
I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.+static void +fsm_local_set(Relation rel, BlockNumber cur_nblocks) +{ + BlockNumber blkno, + cached_target_block; + + /* The local map must not be set already. */ + Assert(!FSM_LOCAL_MAP_EXISTS); + + /* + * Starting at the current last block in the relation and working + * backwards, mark alternating blocks as available. + */ + blkno = cur_nblocks - 1;That comment explains very little about why this is done, and why it's a
good idea.Short answer: performance -- it's too expensive to try every block.
The explanation is in storage/freespace/README -- maybe that should be
referenced here?
Yes. Even just adding "for performance reasons, only try every second
block. See also the README" would be good.
But I'll note that the need to this - and potentially waste space -
counters your claim that there's no performance considerations with this
patch.
+/* Status codes for the local map. */ + +/* Either already tried, or beyond the end of the relation */ +#define FSM_LOCAL_NOT_AVAIL 0x00 + +/* Available to try */ +#define FSM_LOCAL_AVAIL 0x01+/* Local map of block numbers for small heaps with no FSM. */ +typedef struct +{ + BlockNumber nblocks; + uint8 map[HEAP_FSM_CREATION_THRESHOLD]; +} FSMLocalMap; +Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.I fail to see the advantage of that.
It'd allow different AMs to have different numbers of dont-create-fsm
thresholds without needing additional memory (up to 32 blocks).
I'm kinda thinking that this is the wrong architecture.
1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation.That was true previously anyway if the FSM returned InvalidBlockNumber.
True. That was already pretty annoying though.
That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.Could you explain why it would succeed less often?
Two aspects: 1) If more backends access a table, there'll be a higher
chance the target page is full 2) There's more backends that don't have
a target page.
2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.
It's pretty common to have small tables that are modified by a number of
backends. A typical case is tables that implement locks for external
processes and such.
On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
<snip>
and in any case it seems pretty inefficient for workloads that insert
into multiple tables.As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.
Well, try a few hundred relations on nfs (where stat is much more
expensive). Or just pgbench a concurrent workload with a few tables with
one live row each, updated by backends (to simulate lock tables and
such).
But also, my concern here is to a significant degree architectural,
rather than already measurable performance regressions. We ought to work
towards eliminating unnecessary syscalls, not the opposite.
Greetings,
Andres Freund
Hi,
On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
*and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache.I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation. We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation. The current mechanism addresses both this and normal case
where there are not many deletions. Sure there is some overhead of
building the map again and rechecking each page. The first one is a
memory operation and takes a few cycles
Yea, I think creating / resetting the map is basically free.
I'm not sure I buy the concurrency issue - ISTM it'd be perfectly
reasonable to cache the local map (in the relcache) and use it for local
FSM queries, and rebuild it from scratch once no space is found. That'd
avoid a lot of repeated checking of relation size for small, but
commonly changed relations. Add a pre-check of smgr_fsm_nblocks (if >
0, there has to be an fsm), and there should be fewer syscalls.
and for the second we optimized by checking the pages alternatively
which means we won't check more than two pages at-a-time. This cost
is paid by not checking FSM and it could be somewhat better in some
cases [1].
Well, it's also paid by potentially higher bloat, because the
intermediate pages aren't tested.
If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.
ISTM normal invalidation logic should just take of that kind of thing.
OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.
But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.
But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.
Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure? It's not like there's not anything else on our
plates to fix before beta.
regards, tom lane
On Wed, Apr 17, 2019 at 9:46 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
*and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache.I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation. We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation. The current mechanism addresses both this and normal case
where there are not many deletions. Sure there is some overhead of
building the map again and rechecking each page. The first one is a
memory operation and takes a few cyclesYea, I think creating / resetting the map is basically free.
I'm not sure I buy the concurrency issue - ISTM it'd be perfectly
reasonable to cache the local map (in the relcache) and use it for local
FSM queries, and rebuild it from scratch once no space is found. That'd
avoid a lot of repeated checking of relation size for small, but
commonly changed relations.
Okay, so you mean to say that we need to perform additional system
call (to get a number of blocks) only when no space is found in the
existing set of pages? I think that is a fair point, but can't we
achieve that by updating relpages in relation after a call to
RelationGetNumberofBlocks?
Add a pre-check of smgr_fsm_nblocks (if >
0, there has to be an fsm), and there should be fewer syscalls.
Yes, that check is a good one and I see that we already do this check
in fsm code before calling smgrexists.
and for the second we optimized by checking the pages alternatively
which means we won't check more than two pages at-a-time. This cost
is paid by not checking FSM and it could be somewhat better in some
cases [1].Well, it's also paid by potentially higher bloat, because the
intermediate pages aren't tested.If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.ISTM normal invalidation logic should just take of that kind of thing.
Do you mean to say that we don't need to add any new invalidation call
and the existing invalidation calls will automatically take care of
same?
OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.But it really isn't contained to freespace.c. That's my primary
concern.
Okay, I get that point. I think among that also the need to call
FSMClearLocalMap seems to be your main worry which is fair, but OTOH,
the places where it should be called shouldn't be a ton.
You added new parameters (undocumented ones!),
I think this is mostly for compatibility with the old code. I agree
that is a wart, but without much inputs during development, it
doesn't seem advisable to change old behavior, that is why we have
added a new parameter to GetPageWithFreeSpace. However, if we want we
can remove that parameter or add document it in a better way.
FSMClearLocalMap() needs to be called by callers and xlog, etc.
Agreed, that this is an additional requirement, but we have documented
the cases atop of this function where it needs to be called. We might
have missed something, but we tried to cover all cases that we are
aware of. Can we make it more clear by adding the comments atop
freespace.c API where this map is used?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 17, 2019 at 9:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure?
I respect and will follow whatever will be the consensus after
discussion. However, I request you to wait for some time to let the
discussion conclude. If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I respect and will follow whatever will be the consensus after
discussion. However, I request you to wait for some time to let the
discussion conclude. If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.
Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:
1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()
Thoughts?
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-04-17 12:20:29 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places. We might need to rip out some
calls to clearlocalmap.But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure? It's not like there's not anything else on our
plates to fix before beta.
Hm. I'm of split mind here:
It's a nice improvement, and the fixes probably wouldn't be that
hard. And we could have piped up a bit earlier about these concerns (I
only noticed this when rebasing zheap onto the newest version of
postgres).
But as you it's also late, and there's other stuff to do. Although I
think neither Amit nor John is heavily involved in any...
My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.
Regards,
Andres
Andres Freund <andres@anarazel.de> writes:
My compromise suggestion would be to try to give John and Amit ~2 weeks
to come up with a cleanup proposal, and then decide whether to 1) revert
2) apply the new patch, 3) decide to live with the warts for 12, and
apply the patch in 13. As we would already have a patch, 3) seems like
it'd be more tenable than without.
Seems reasonable. I think we should shoot to have this resolved before
the end of the month, but it doesn't have to be done immediately.
regards, tom lane
On Thu, Apr 18, 2019 at 2:10 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I respect and will follow whatever will be the consensus after
discussion. However, I request you to wait for some time to let the
discussion conclude. If we can't get to an
agreement or one of John or me can't implement what is decided, then
we can anyway revert it.Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()
Won't you need an extra call to RelationGetNumberofBlocks to find the
last block? Also won't it be less efficient in terms of dealing with
bloat as compare to current patch? I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.
I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.
FWIW, it's hard for me to see how moving the map to the relcache isn't
the right thing to do. You will lose state during a relcache flush,
but that's still miles better than how often the state gets lost now.
regards, tom lane
On April 18, 2019 7:53:58 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
I am thinking that we should at least give it a try to move the map
to
rel cache level to see how easy or difficult it is and also let's
wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.FWIW, it's hard for me to see how moving the map to the relcache isn't
the right thing to do. You will lose state during a relcache flush,
but that's still miles better than how often the state gets lost now.
+1
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Apr 19, 2019 at 10:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 18, 2019 at 2:10 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
Agreed. I suspect the most realistic way to address most of the
objections in a short amount of time would be to:1. rip out the local map
2. restore hio.c to only checking the last block in the relation if
there is no FSM (and lower the threshold to reduce wasted space)
3. reduce calls to smgr_exists()Won't you need an extra call to RelationGetNumberofBlocks to find the
last block?
If I understand you correctly, no, the call now in
GetPageWithFreeSpace() just moves it back to where it was in v11. In
the corner case where we just measured the table size and the last
block is full, we can pass nblocks to RecordAndGetPageWithFreeSpace().
There might be further optimizations available if we're not creating a
local map.
Also won't it be less efficient in terms of dealing with
bloat as compare to current patch?
Yes. The threshold would have to be 2 or 3 blocks, and it would stay
bloated until it passed the threshold. Not great, but perhaps not bad
either.
I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.
It's a much lighter-weight API, which has that much going for it.
I have a draft implementation, which I can share if it comes to that
-- it needs some more thought and polish first.
I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.
Since we have a definite timeline, I'm okay with that, although I'm
afraid I'm not quite knowledgeable enough to help much with the
relcache piece.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 19, 2019 at 1:17 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
On Fri, Apr 19, 2019 at 10:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
I think if we go this route, then
we might need to revisit it in the future to optimize it, but maybe
that is the best alternative as of now.It's a much lighter-weight API, which has that much going for it.
I have a draft implementation, which I can share if it comes to that
-- it needs some more thought and polish first.
I understand that it is lighter-weight API, but OTOH, it will be less
efficient as well. Also, the consensus seems to be that we should
move this to relcache.
I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.Since we have a definite timeline, I'm okay with that, although I'm
afraid I'm not quite knowledgeable enough to help much with the
relcache piece.
Okay, I can try to help. I think you can start by looking at
RelationData members (for ex. see how we cache index's metapage in
rd_amcache) and study a bit about routines in relcache.h.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Apr 19, 2019 at 2:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I am thinking that we should at least give it a try to move the map to
rel cache level to see how easy or difficult it is and also let's wait
for a day or two to see if Andres/Tom has to say anything about this
or on the response by me above to improve the current patch.Since we have a definite timeline, I'm okay with that, although I'm
afraid I'm not quite knowledgeable enough to help much with the
relcache piece.Okay, I can try to help. I think you can start by looking at
RelationData members (for ex. see how we cache index's metapage in
rd_amcache) and study a bit about routines in relcache.h.
Attached is a hacky and work-in-progress patch to move fsm map to
relcache. This will give you some idea. I think we need to see if
there is a need to invalidate the relcache due to this patch. I think
some other comments of Andres also need to be addressed, see if you
can attempt to fix some of them.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com