non-HOT update not looking at FSM for large tuple update
Hi hackers,
Recently we found a table that was slowly, but consistently increasing in size. The table has a low fill-factor set and was updated very frequently. As expected, almost all updates are HOT updates, but for some of the non-HOT updates it always wanted to use a new page, rather than reuse an existing empty page. This led to a steady growth in table size (and a steady growth in the number of empty pages in the table).
I've managed to create a very simple reproducing example that shows the problem (original problem occurred on 12.4, but I've tested this example on latest master). It only occurs for updates where the new tuple is larger than the size of what "fillfactor" would normally allow. In real life, this would only be a very small portion of the updates to a certain table of course, but in this example every update will be this large.
Create a table with a low fill-factor and insert one row into it. Note that, in this case, the row that we're inserting is by itself larger than the "max fill factor space".
create table t1 (a int primary key, b text) with (fillfactor=10);
insert into t1 select 1, (select string_agg('1', '') from generate_series(1,1000)); -- 1000 byte text field
vacuum t1;
postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 7104
(1 row)
This looks alright - there's 1 page and the available space is indeed roughly 1000 bytes less, because of our tuple and page header.
Now, in a different backend, initiate a longer query.
select pg_sleep(600); -- just sleep 600 seconds so that we have enough time to do some updates during this
Then, in the original backend, update the tuple 7 times.
-- execute this 7 times
update t1 set b=(select string_agg((random()*9)::int::text, '') from generate_series(1,1000)) where a=1;
Cancel the pg_sleep call.
Then execute
vacuum t1; -- cleans rows and updates the fsm
postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 8128
1 | 7104
(2 rows)
This still looks OK. There's an extra page, because a total of 8 tuples needed to kept alive for the pg_sleep query. These didn't fit on one page, so a new page was created.
Now, repeat it (the pg_sleep, update 7 times, cancel the pg_sleep and vacuum).
postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 8128
1 | 8128
2 | 7104
(3 rows)
This does not look good anymore. The tuple was on page 1, so at first there were several HOT updates on page 1. Then, when page 1 was full, it needed to search for another page to put the tuple. It did not consider page 0, but instead decided to create a new page 2.
Repeating this process would create a new page each time, never reusing the empty old pages.
The reason it does not consider page 0 is because of this piece of code in function RelationGetBufferForTuple in hio.c:
/* Compute desired extra freespace due to fillfactor option */
saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR);
...
if (len + saveFreeSpace > MaxHeapTupleSize)
{
/* can't fit, don't bother asking FSM */
targetBlock = InvalidBlockNumber;
use_fsm = false;
}
The problem here is two-folded: for any non-HOT update of a tuple that's larger than the size of the fillfactor, the fsm will not be used, but instead a new page will be chosen.
This seems to rely on the false assumption that every existing page has at last one tuple on it.
Secondly, and this is a bit trickier.. Even if we would ask the FSM to come up with a free page with a free size of "MaxHeapTupleSize", it wouldn't find anything... This is, because the FSM tracks free space excluding any unused line pointers. In this example, if we look at block 0:
postgres=# select * from page_header(get_raw_page('t1', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
0/16D75A0 | 0 | 5 | 52 | 8192 | 8192 | 8192 | 4 | 0
(1 row)
postgres=# select * from heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
1 | 0 | 0 | 0 | | | | | | | | | |
2 | 0 | 0 | 0 | | | | | | | | | |
3 | 0 | 0 | 0 | | | | | | | | | |
4 | 0 | 0 | 0 | | | | | | | | | |
5 | 0 | 0 | 0 | | | | | | | | | |
6 | 0 | 0 | 0 | | | | | | | | | |
7 | 0 | 0 | 0 | | | | | | | | | |
(7 rows)
There are 7 line pointers on this page, consuming 28 bytes. Plus the 24 byte header, that means that lower=52. However, all line pointers are unused, so the page really is empty. The FSM does not see the page as empty though, as it only looks at "upper-lower".
When asking the FSM for slightly less space (MaxHeapTupleSize - 50 for example), it does find the free pages. I've confirmed that with such a hack the table is not growing indefinitely anymore. However, this number 50 is rather arbitrary obviously, as it depends on the number of unused line items on a page, so that's not a proper way to fix things.
In any case, the behavior feels like a bug to me, but I don't know what the best way would be to fix it. Thoughts?
-Floris
On Wed, Feb 24, 2021 at 10:44 AM Floris Van Nee <florisvannee@optiver.com>
wrote:
The problem here is two-folded: for any non-HOT update of a tuple that’s
larger than the size of the fillfactor, the fsm will not be used, but
instead a new page will be chosen.
I confirmed this not only non-HOT updates, but regular inserts, which are
the same thing in this context.
This seems to rely on the false assumption that every existing page has
at last one tuple on it.
Yep.
Secondly, and this is a bit trickier.. Even if we would ask the FSM to
come up with a free page with a free size of “MaxHeapTupleSize”, it
wouldn’t find anything… This is, because the FSM tracks free space
excluding any unused line pointers.
There are 7 line pointers on this page, consuming 28 bytes. Plus the 24
byte header, that means that lower=52. However, all line pointers are
unused, so the page really is empty. The FSM does not see the page as empty
though, as it only looks at “upper-lower”.
When asking the FSM for slightly less space (MaxHeapTupleSize – 50 for
example), it does find the free pages. I’ve confirmed that with such a hack
the table is not growing indefinitely anymore. However, this number 50 is
rather arbitrary obviously, as it depends on the number of unused line
items on a page, so that’s not a proper way to fix things.
In any case, the behavior feels like a bug to me, but I don’t know what
the best way would be to fix it. Thoughts?
One idea is to take your -50 idea and make it more general and safe, by
scaling the fudge factor based on fillfactor, such that if fillfactor is
less than 100, the requested freespace is a bit smaller than the max. It's
still a bit of a hack, though. I've attached a draft of this idea.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
allow-inserting-tuples-into-almost-empty-pages.patchapplication/octet-stream; name=allow-inserting-tuples-into-almost-empty-pages.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index fb7ad0bab4..82c818bb16 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -337,7 +337,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
Buffer buffer = InvalidBuffer;
Page page;
Size pageFreeSpace = 0,
- saveFreeSpace = 0;
+ saveFreeSpace = 0,
+ targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
bool needLock;
@@ -360,6 +361,23 @@ RelationGetBufferForTuple(Relation relation, Size len,
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
+ if (len + saveFreeSpace > MaxHeapTupleSize)
+ {
+ /*
+ * When inserting the tuple in an empty page, we want it to succeed,
+ * even if it would violate the fillfactor. Otherwise, we would
+ * unnecessarily extend the relation. Instead, ask for
+ * MaxHeapTupleSize bytes, minus a small fudge factor. This factor
+ * will allow the FSM to return a page that is not quite empty because
+ * of free line pointers.
+ */
+ targetFreeSpace = MaxHeapTupleSize -
+ (HEAP_DEFAULT_FILLFACTOR -
+ RelationGetFillFactor(relation, HEAP_DEFAULT_FILLFACTOR));
+ }
+ else
+ targetFreeSpace = len + saveFreeSpace;
+
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
@@ -378,13 +396,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
- if (len + saveFreeSpace > MaxHeapTupleSize)
- {
- /* can't fit, don't bother asking FSM */
- targetBlock = InvalidBlockNumber;
- use_fsm = false;
- }
- else if (bistate && bistate->current_buf != InvalidBuffer)
+ if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
@@ -395,7 +407,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an initial
* target.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
@@ -519,7 +531,7 @@ loop:
}
pageFreeSpace = PageGetHeapFreeSpace(page);
- if (len + saveFreeSpace <= pageFreeSpace)
+ if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
@@ -552,7 +564,7 @@ loop:
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
- len + saveFreeSpace);
+ targetFreeSpace);
}
/*
@@ -584,7 +596,7 @@ loop:
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/*
* If some other waiter has already extended the relation, we
Hi John,
One idea is to take your -50 idea and make it more general and safe, by scaling the fudge factor based on fillfactor, such that if fillfactor is less than 100, the requested freespace is a bit smaller than the max. It's still a bit of a hack, though. I've attached a draft of this idea.
You’re right, that’d work better. Though, one thing I'd forgot to mention earlier is that in the "wild" where this occurred, the UPDATEs with these large tuple sizes are much rarer than UPDATEs with a much smaller tuple size. So this means that in reality, when this happens, the empty pages contain more unused line pointers and we’d need to subtract more bytes in order to find those pages in the fsm.
This is the (partial) output of pg_freespace function, bucketed by free space, for a real-life table with fillfactor=10 under the mixed load that I've described.
│ free │ count │
│ 7750 │ 2003 │
│ 7800 │ 7113 │
│ 7850 │ 1781 │
│ 7900 │ 6803 │
│ 7950 │ 13643 │
│ 8000 │ 64779 │
│ 8050 │ 2469665 │
│ 8100 │ 61869 │
└──────┴─────────┘
(163 rows)
The ‘free’ column is the bucket where the value is the lower limit. So, free=7500 means between 7500-7549 bytes free, and count is the number of pages that have this amount free according to the fsm.
In this case, the vast majority has between 8050-8099 bytes free according to the FSM. That means that, for this particular case, for a fillfactor of 10, we’d need to subtract ~120 bytes or so in order to properly recycle the pages.
-Floris
In this case, the vast majority has between 8050-8099 bytes free according to the FSM. That means that, for this particular case, for a fillfactor of 10, we’d need to subtract ~120 bytes or so in order to properly recycle the pages.
Also, I think this "fudge" factor would need to be defined as a percentage of the page size as well. 100 bytes on an 8kB page is quite different than 100 bytes on a 1kB page (although I have no idea if people ever actually compile PG with a different page size, but it is supposed to be supported?).
I also understand the temptation to define it based on the relation's fill factor, as you did in the patch. However, upon some further thought I wonder if that's useful? A relation with a higher fill factor will have a lower 'saveFreeSpace' variable, so it's less likely to run into issues in finding a fresh page, except if the tuple you're inserting/updating is even larger. However, if that case happens, you'll still be wanting to look for a page that's completely empty (except for the line items). So the only proper metric is 'how many unused line items do we expect on empty pages' and the fillfactor doesn't say much about this. Since this is probably difficult to estimate at all, we may be better off just defining it off MaxHeapTupleSize completely?
For example, we expect 1.5% of the page could be line items, then:
targetFreeSpace = MaxHeapTupleSize * 0.985
-Floris
On Wed, Feb 24, 2021 at 4:52 PM Floris Van Nee <florisvannee@optiver.com>
wrote:
I also understand the temptation to define it based on the relation's
fill factor, as you did in the patch. However, upon some further thought I
wonder if that's useful? A relation with a higher fill factor will have a
lower 'saveFreeSpace' variable, so it's less likely to run into issues in
finding a fresh page, except if the tuple you're inserting/updating is even
larger. However, if that case happens, you'll still be wanting to look for
a page that's completely empty (except for the line items). So the only
proper metric is 'how many unused line items do we expect on empty pages'
and the fillfactor doesn't say much about this. Since this is probably
difficult to estimate at all, we may be better off just defining it off
MaxHeapTupleSize completely?
For example, we expect 1.5% of the page could be line items, then:
targetFreeSpace = MaxHeapTupleSize * 0.985
That makes sense, although the exact number seems precisely tailored to
your use case. 2% gives 164 bytes of slack and doesn't seem too large.
Updated patch attached.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v2-allow-inserting-tuples-into-almost-empty-pages.patchapplication/octet-stream; name=v2-allow-inserting-tuples-into-almost-empty-pages.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index fb7ad0bab4..fbd95aedf8 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -337,7 +337,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
Buffer buffer = InvalidBuffer;
Page page;
Size pageFreeSpace = 0,
- saveFreeSpace = 0;
+ saveFreeSpace = 0,
+ targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
bool needLock;
@@ -360,6 +361,22 @@ RelationGetBufferForTuple(Relation relation, Size len,
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
+ if (len + saveFreeSpace > MaxHeapTupleSize)
+ {
+ /*
+ * We want to allow inserting a large tuple into an empty page
+ * even if that would violate the fillfactor. Otherwise, we would
+ * unnecessarily extend the relation. Instead, ask the FSM for
+ * MaxHeapTupleSize bytes, minus a small amount of slack. This
+ * will allow it to return a page that is not quite empty because
+ * of free line pointers. We allow 2% slack, which is somewhat
+ * arbitrary.
+ */
+ targetFreeSpace = MaxHeapTupleSize - (MaxHeapTupleSize * 2 / 100);
+ }
+ else
+ targetFreeSpace = len + saveFreeSpace;
+
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
@@ -378,13 +395,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
- if (len + saveFreeSpace > MaxHeapTupleSize)
- {
- /* can't fit, don't bother asking FSM */
- targetBlock = InvalidBlockNumber;
- use_fsm = false;
- }
- else if (bistate && bistate->current_buf != InvalidBuffer)
+ if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
@@ -395,7 +406,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an initial
* target.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
@@ -519,7 +530,7 @@ loop:
}
pageFreeSpace = PageGetHeapFreeSpace(page);
- if (len + saveFreeSpace <= pageFreeSpace)
+ if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
@@ -552,7 +563,7 @@ loop:
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
- len + saveFreeSpace);
+ targetFreeSpace);
}
/*
@@ -584,7 +595,7 @@ loop:
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/*
* If some other waiter has already extended the relation, we
That makes sense, although the exact number seems precisely tailored to your use case. 2% gives 164 bytes of slack and doesn't seem too large. Updated patch attached.
Yeah, I tried picking it as conservative as I could, but 2% is obviously great too. :-) I can't think of any large drawbacks either of having a slightly larger value.
Thanks for posting the patch!
-Floris
On Wed, Feb 24, 2021 at 6:29 PM Floris Van Nee <florisvannee@optiver.com>
wrote:
That makes sense, although the exact number seems precisely tailored to
your use case. 2% gives 164 bytes of slack and doesn't seem too large.
Updated patch attached.
Yeah, I tried picking it as conservative as I could, but 2% is obviously
great too. :-) I can't think of any large drawbacks either of having a
slightly larger value.
Thanks for posting the patch!
I've added this to the commitfest as a bug fix and added you as an author.
--
John Naylor
EDB: http://www.enterprisedb.com
I've added this to the commitfest as a bug fix and added you as an author.
Thanks. Patch looks good to me, but I guess there needs to be someone else reviewing too?
Also, would this be a backpatchable bugfix?
-Floris
On Mon, 8 Mar 2021 at 16:25, Floris Van Nee <florisvannee@optiver.com> wrote:
I've added this to the commitfest as a bug fix and added you as an author.
Thanks. Patch looks good to me, but I guess there needs to be someone else reviewing too?
Also, would this be a backpatchable bugfix?-Floris
This patch fails to consider that len may be bigger than
MaxHeapTupleSize * 0.98, which in this case triggers a reproducable
PANIC:
=# CREATE TABLE t_failure (a int, b text) WITH (fillfactor = 10); --
force the new FSM calculation for large tuples
CREATE TABLE
=# ALTER TABLE t_failure ALTER COLUMN b SET STORAGE plain;
ALTER TABLE
=# INSERT INTO t_failure (SELECT FROM generate_series(1, 32)); -- use
up 32 line pointers on the first page.
INSERT 0 32
=# DELETE FROM t_failure;
DELETE 32
=# VACUUM (TRUNCATE OFF) t_failure; -- we now have a page that has
MaxHeapTupleSize > free space > 98% MaxHeapTupleSize
VACUUM
=# INSERT INTO t_failure (select 1, string_agg('1', '') from
generate_series(1, 8126));
PANIC: failed to add tuple to page
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
A possible solution should always request at least the size of the
requested tuple, e.g.:
- targetFreeSpace = MaxHeapTupleSize - (MaxHeapTupleSize * 2 / 100);
+ targetFreeSpace = Max(len, MaxHeapTupleSize - (MaxHeapTupleSize * 2 / 100));
One different question I have, though, is why we can't "just" teach
vacuum to clean up trailing unused line pointers. As in, can't we trim
the line pointer array when vacuum detects that the trailing line
pointers on the page are all unused?
The only documentation that I could find that this doesn't happen is
in the comment on PageIndexTupleDelete and PageRepairFragmentation,
both not very descriptive on why we can't shrink the page->pd_linp
array. One is "Unlike heap pages, we compact out the line pointer for
the removed tuple." (Jan. 2002), and the other is "It doesn't remove
unused line pointers! Please don't change this." (Oct. 2000), but I
can't seem to find the documentation / conversations on the
implications that such shrinking would have.
With regards,
Matthias van de Meent.
Hi,
This patch fails to consider that len may be bigger than MaxHeapTupleSize *
0.98, which in this case triggers a reproducable
PANIC:
Good catch! I've adapted the patch with your suggested fix.
One different question I have, though, is why we can't "just" teach vacuum
to clean up trailing unused line pointers. As in, can't we trim the line pointer
array when vacuum detects that the trailing line pointers on the page are all
unused?The only documentation that I could find that this doesn't happen is in the
comment on PageIndexTupleDelete and PageRepairFragmentation, both not
very descriptive on why we can't shrink the page->pd_linp array. One is
"Unlike heap pages, we compact out the line pointer for the removed tuple."
(Jan. 2002), and the other is "It doesn't remove unused line pointers! Please
don't change this." (Oct. 2000), but I can't seem to find the documentation /
conversations on the implications that such shrinking would have.
This is an interesting alternative indeed. I also can't find any documentation/conversation about this and the message is rather cryptic.
Hopefully someone on the list still remembers the reasoning behind this rather cryptic comment in PageRepairFragmentation.
-Floris
Attachments:
v3-Allow-inserting-tuples-into-almost-empty-pages.patchapplication/octet-stream; name=v3-Allow-inserting-tuples-into-almost-empty-pages.patchDownload
From 5c35e641de47fa23ed86924c20076919321b6444 Mon Sep 17 00:00:00 2001
From: Floris van Nee <florisvannee@optiver.com>
Date: Tue, 9 Mar 2021 14:21:00 +0100
Subject: [PATCH] Allow inserting tuples into almost-empty pages
When selecting a new page to insert tuples in, the FSM was not
always taken into account. This happens when the tuple to be
inserted is larger than the maximum number of bytes that should
be on the page according to the fill factor. This leads to ever-
growing tables when large tuples are frequently updated, as
these will always be put on a new page. This patch addresses the
bug by properly selecting empty pages for reuse.
---
src/backend/access/heap/hio.c | 35 +++++++++++++++++++++++------------
1 file changed, 23 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 75223c9bea..4433e63b30 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -335,7 +335,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
Buffer buffer = InvalidBuffer;
Page page;
Size pageFreeSpace = 0,
- saveFreeSpace = 0;
+ saveFreeSpace = 0,
+ targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
bool needLock;
@@ -358,6 +359,22 @@ RelationGetBufferForTuple(Relation relation, Size len,
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
+ if (len + saveFreeSpace > MaxHeapTupleSize)
+ {
+ /*
+ * We want to allow inserting a large tuple into an empty page
+ * even if that would violate the fillfactor. Otherwise, we would
+ * unnecessarily extend the relation. Instead, ask the FSM for
+ * MaxHeapTupleSize bytes, minus a small amount of slack. This
+ * will allow it to return a page that is not quite empty because
+ * of free line pointers. We allow 2% slack, which is somewhat
+ * arbitrary.
+ */
+ targetFreeSpace = Max(len, MaxHeapTupleSize - (MaxHeapTupleSize * 2 / 100));
+ }
+ else
+ targetFreeSpace = len + saveFreeSpace;
+
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
@@ -376,13 +393,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
- if (len + saveFreeSpace > MaxHeapTupleSize)
- {
- /* can't fit, don't bother asking FSM */
- targetBlock = InvalidBlockNumber;
- use_fsm = false;
- }
- else if (bistate && bistate->current_buf != InvalidBuffer)
+ if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
@@ -393,7 +404,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an initial
* target.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
@@ -517,7 +528,7 @@ loop:
}
pageFreeSpace = PageGetHeapFreeSpace(page);
- if (len + saveFreeSpace <= pageFreeSpace)
+ if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
@@ -550,7 +561,7 @@ loop:
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
- len + saveFreeSpace);
+ targetFreeSpace);
}
/*
@@ -582,7 +593,7 @@ loop:
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/*
* If some other waiter has already extended the relation, we
--
2.29.2
On Tue, Mar 9, 2021 at 9:40 AM Floris Van Nee <florisvannee@optiver.com>
wrote:
Hi,
This patch fails to consider that len may be bigger than
MaxHeapTupleSize *
0.98, which in this case triggers a reproducable
PANIC:Good catch! I've adapted the patch with your suggested fix.
Thank you both for testing and for the updated patch. It seems we should
add a regression test, but it's not clear which file it belongs in.
Possibly insert.sql?
One different question I have, though, is why we can't "just" teach
vacuum
to clean up trailing unused line pointers. As in, can't we trim the
line pointer
array when vacuum detects that the trailing line pointers on the page
are all
unused?
That seems like the proper fix, and I see you've started a thread for that.
I don't think that change in behavior would be backpatchable, but patch
here might have a chance at that.
--
John Naylor
EDB: http://www.enterprisedb.com
I wrote:
That seems like the proper fix, and I see you've started a thread for
that. I don't think that change in behavior would be backpatchable, but
patch here might have a chance at that.
I remembered after the fact that truncating line pointers would only allow
for omitting the 2% slack logic (and has other benefits), but the rest of
this patch would be needed regardless.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, 9 Mar 2021 at 18:39, John Naylor <john.naylor@enterprisedb.com> wrote:
I wrote:
That seems like the proper fix, and I see you've started a thread for that. I don't think that change in behavior would be backpatchable, but patch here might have a chance at that.
I remembered after the fact that truncating line pointers would only allow for omitting the 2% slack logic (and has other benefits), but the rest of this patch would be needed regardless.
Regarding the 2% slack logic, could we change it to use increments of
line pointers instead? That makes it more clear what problem this
solution is trying to work around; that is to say line pointers not
(all) being truncated away. The currently subtracted value accounts
for the size of 40 line pointers on 8k-pages (~ 13.7% of
MaxHeapTuplesPerPage), and slightly higher fractions (up to 13.94%)
for larger page sizes. As the to-be-inserted tuple is already _at
least_ 10% of MaxHeapTupleSize when it hits this new code.
Also, even with this patch, we do FSM-requests of for sizes between
MaxHeapTupleSize - 2% and MaxHeapTupleSize, if len+saveFreeSpace falls
between those two numbers. I think we better clamp the fsm request
between `len` and `MaxHeapTupleSize - PAGE_SIZE_DEPENDENT_FACTOR`.
So, I sugges the following incremental patch:
bool needLock;
+ const Size maxPaddedFsmRequest = MaxHeapTupleSize -
(MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
...
- if (len + saveFreeSpace > MaxHeapTupleSize)
+ if (len + saveFreeSpace > maxPaddedFsmRequest)
...
- targetFreeSpace = Max(len, MaxHeapTupleSize - (MaxHeapTupleSize * 2 / 100));
+ targetFreeSpace = Max(len, maxPaddedFsmRequest);
...
Other than this, I think this is a good fix.
With regards,
Matthias van de Meent.
On Thu, Mar 11, 2021 at 9:46 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:
Regarding the 2% slack logic, could we change it to use increments of
line pointers instead? That makes it more clear what problem this
solution is trying to work around; that is to say line pointers not
(all) being truncated away. The currently subtracted value accounts
That makes sense.
... - if (len + saveFreeSpace > MaxHeapTupleSize) + if (len + saveFreeSpace > maxPaddedFsmRequest) ... - targetFreeSpace = Max(len, MaxHeapTupleSize - (MaxHeapTupleSize * 2 /
100));
+ targetFreeSpace = Max(len, maxPaddedFsmRequest);
...
If we have that convenient constant, it seems equivalent (I think) and a
bit more clear to write it this way, but I'm not wedded to it:
if (len + saveFreeSpace > MaxHeapTupleSize &&
len <= maxPaddedFsmRequest)
{
...
targetFreeSpace = maxPaddedFsmRequest;
}
else
targetFreeSpace = len + saveFreeSpace;
Also, should I write a regression test for it? The test case is already
available, just no obvious place to put it.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, 11 Mar 2021 at 16:16, John Naylor <john.naylor@enterprisedb.com> wrote:
On Thu, Mar 11, 2021 at 9:46 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Regarding the 2% slack logic, could we change it to use increments of
line pointers instead? That makes it more clear what problem this
solution is trying to work around; that is to say line pointers not
(all) being truncated away. The currently subtracted value accountsThat makes sense.
... - if (len + saveFreeSpace > MaxHeapTupleSize) + if (len + saveFreeSpace > maxPaddedFsmRequest) ... - targetFreeSpace = Max(len, MaxHeapTupleSize - (MaxHeapTupleSize * 2 / 100)); + targetFreeSpace = Max(len, maxPaddedFsmRequest); ...If we have that convenient constant, it seems equivalent (I think) and a bit more clear to write it this way, but I'm not wedded to it:
if (len + saveFreeSpace > MaxHeapTupleSize &&
len <= maxPaddedFsmRequest)
{
...
targetFreeSpace = maxPaddedFsmRequest;
}
+ else if (len > maxPaddedFsmRequest
+ {
+ /* request len amount of space; it might still fit on
not-quite-empty pages */
+ targetFreeSpace = len;
+ }
If this case isn't added, the lower else branch will fail to find
fitting pages for len > maxPaddedFsmRequest tuples; potentially
extending the relation when there is actually still enough space
available.
else
targetFreeSpace = len + saveFreeSpace;
Also, should I write a regression test for it? The test case is already available, just no obvious place to put it.
I think it would be difficult to write tests that exhibit the correct
behaviour on BLCKSZ != 8196. On the other hand, I see there are some
tests that explicitly call out that they expect BLCKSZ to be 8192, so
that has not really been a hard block before.
The previous code I sent had initial INSERT + DELETE + VACUUM. These
statements can be replaced with `INSERT INTO t_failure (b) VALUES
(repeat('1', 95)); VACUUM;` for simplicity. The vacuum is still needed
to populate the FSM for the new page.
With regards,
Matthias van de Meent
On Fri, Mar 12, 2021 at 8:45 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:
If this case isn't added, the lower else branch will fail to find
fitting pages for len > maxPaddedFsmRequest tuples; potentially
extending the relation when there is actually still enough space
available.
Okay, with that it looks better to go back to using Max().
Also in v4:
- With the separate constant you suggested, I split up the comment into two
parts.
- I've added a regression test to insert.sql similar to your earlier test,
but we cannot use vacuum, since in parallel tests there could still be
tuples visible to other transactions. It's still possible to test
almost-all-free by inserting a small tuple.
- Draft commit message
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v4-0001-Fix-bug-in-heap-space-management-that-was-overly-.patchapplication/octet-stream; name=v4-0001-Fix-bug-in-heap-space-management-that-was-overly-.patchDownload
From b15db4fb408eb3e642b123e5aaf64f7d8a269a23 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Wed, 17 Mar 2021 16:48:17 -0400
Subject: [PATCH v4] Fix bug in heap space management that was overly cautious
about fillfactor
Previously, if there were leftover ununused line pointers in a page,
a table with a low fill factor would not be available for new tuples
that were large enough to exceed fillfactor, even though the page
contained no tuples at all. This lead to unnecessary relation
extension.
Fix by allowing some slack space equal to 1/8 the maximum space that
could be taken up by line pointers. This is somewhat arbitrary, but
should allow many cases to succeed where they didn't before.
Per report from Floris van Nee.
---
src/backend/access/heap/hio.c | 40 +++++++++++++++++++---------
src/test/regress/expected/insert.out | 22 +++++++++++++++
src/test/regress/sql/insert.sql | 23 ++++++++++++++++
3 files changed, 73 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 75223c9bea..ff2890e9e1 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -335,11 +335,21 @@ RelationGetBufferForTuple(Relation relation, Size len,
Buffer buffer = InvalidBuffer;
Page page;
Size pageFreeSpace = 0,
- saveFreeSpace = 0;
+ saveFreeSpace = 0,
+ targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
bool needLock;
+ /*
+ * The minimum space on a page for it to be considered "empty" regardless
+ * of fillfactor. We base this on MaxHeapTupleSize, minus a small amount
+ * of slack. We allow slack equal to 1/8 the maximum space that could be
+ * taken by line pointers, which is somewhat arbitrary.
+ */
+ const Size maxPaddedFsmRequest = MaxHeapTupleSize -
+ (MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
+
len = MAXALIGN(len); /* be conservative */
/* Bulk insert is not supported for updates, only inserts. */
@@ -358,6 +368,18 @@ RelationGetBufferForTuple(Relation relation, Size len,
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
+ /*
+ * We want to allow inserting a large tuple into an empty page even if
+ * that would violate the fillfactor. Otherwise, we would unnecessarily
+ * extend the relation. Instead, ask the FSM for maxPaddedFsmRequest
+ * bytes. This will allow it to return a page that is not quite empty
+ * because of unused line pointers.
+ */
+ if (len + saveFreeSpace > maxPaddedFsmRequest)
+ targetFreeSpace = Max(len, maxPaddedFsmRequest);
+ else
+ targetFreeSpace = len + saveFreeSpace;
+
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
@@ -376,13 +398,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
- if (len + saveFreeSpace > MaxHeapTupleSize)
- {
- /* can't fit, don't bother asking FSM */
- targetBlock = InvalidBlockNumber;
- use_fsm = false;
- }
- else if (bistate && bistate->current_buf != InvalidBuffer)
+ if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
@@ -393,7 +409,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an initial
* target.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
@@ -517,7 +533,7 @@ loop:
}
pageFreeSpace = PageGetHeapFreeSpace(page);
- if (len + saveFreeSpace <= pageFreeSpace)
+ if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
@@ -550,7 +566,7 @@ loop:
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
- len + saveFreeSpace);
+ targetFreeSpace);
}
/*
@@ -582,7 +598,7 @@ loop:
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/*
* If some other waiter has already extended the relation, we
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index da50ee3b67..400a45f799 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -82,6 +82,28 @@ select col1, col2, char_length(col3) from inserttest;
drop table inserttest;
--
+-- test using FSM with large tuples and low fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+-- simulate having a few line pointers on the page so that the
+-- page has free space slightly less than MaxHeapTupleSize
+INSERT INTO large_tuple_test (select 1, NULL);
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+ pg_size_pretty
+----------------
+ 8192 bytes
+(1 row)
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+DROP TABLE large_tuple_test;
+--
-- check indirection (field/array assignment), cf bug #14265
--
-- these tests are aware that transformInsertStmt has 3 separate code paths
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 963faa1614..560ccbc652 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -37,6 +37,29 @@ select col1, col2, char_length(col3) from inserttest;
drop table inserttest;
+--
+-- test using FSM with large tuples and low fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+
+-- simulate having a few line pointers on the page so that the
+-- page has free space slightly less than MaxHeapTupleSize
+INSERT INTO large_tuple_test (select 1, NULL);
+
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+
+DROP TABLE large_tuple_test;
+
--
-- check indirection (field/array assignment), cf bug #14265
--
--
2.22.0
On Wed, 17 Mar 2021 at 21:52, John Naylor <john.naylor@enterprisedb.com> wrote:
On Fri, Mar 12, 2021 at 8:45 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
If this case isn't added, the lower else branch will fail to find
fitting pages for len > maxPaddedFsmRequest tuples; potentially
extending the relation when there is actually still enough space
available.Okay, with that it looks better to go back to using Max().
Also in v4:
- With the separate constant you suggested, I split up the comment into two parts.
+ * The minimum space on a page for it to be considered "empty" regardless + * of fillfactor. We base this on MaxHeapTupleSize, minus a small amount + * of slack. We allow slack equal to 1/8 the maximum space that could be + * taken by line pointers, which is somewhat arbitrary.
+ * We want to allow inserting a large tuple into an empty page even if + * that would violate the fillfactor. Otherwise, we would unnecessarily + * extend the relation. Instead, ask the FSM for maxPaddedFsmRequest + * bytes. This will allow it to return a page that is not quite empty + * because of unused line pointers
How about
+ * Because pages that have no items left can still have space allocated
+ * to item pointers, we consider pages "empty" for FSM requests when they
+ * have at most 1/8 of their MaxHeapTuplesPerPage item pointers' space
+ * allocated. This is a somewhat arbitrary number, but should prevent
+ * most unnecessary relation extensions due to not finding "empty" pages
+ * while inserting combinations of large tuples with low fillfactors.
+ * When the free space to be requested from the FSM is greater than
+ * maxPaddedFsmRequest, this is considered equivalent to requesting
+ * insertion on an "empty" page, so instead of failing to find a page
+ * with more empty space than an "empty" page and extend the relation,
+ * we try to find a page which is considered "emtpy".
This is slightly more verbose, but I think this clarifies the
reasoning why we need this a bit better. Feel free to reject or adapt
as needed.
- I've added a regression test to insert.sql similar to your earlier test, but we cannot use vacuum, since in parallel tests there could still be tuples visible to other transactions. It's still possible to test almost-all-free by inserting a small tuple.
- Draft commit message
Apart from these mainly readability changes in comments, I think this is ready.
Show quoted text
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Mar 18, 2021 at 5:30 PM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:
+ * The minimum space on a page for it to be considered "empty"
regardless
+ * of fillfactor. We base this on MaxHeapTupleSize, minus a small
amount
+ * of slack. We allow slack equal to 1/8 the maximum space that
could be
+ * taken by line pointers, which is somewhat arbitrary.
+ * We want to allow inserting a large tuple into an empty page
even if
+ * that would violate the fillfactor. Otherwise, we would
unnecessarily
+ * extend the relation. Instead, ask the FSM for
maxPaddedFsmRequest
+ * bytes. This will allow it to return a page that is not quite
empty
+ * because of unused line pointers
How about
+ * Because pages that have no items left can still have space
allocated
+ * to item pointers, we consider pages "empty" for FSM requests when
they
+ * have at most 1/8 of their MaxHeapTuplesPerPage item pointers' space + * allocated. This is a somewhat arbitrary number, but should prevent + * most unnecessary relation extensions due to not finding "empty"
pages
+ * while inserting combinations of large tuples with low fillfactors.
+ * When the free space to be requested from the FSM is greater than + * maxPaddedFsmRequest, this is considered equivalent to requesting + * insertion on an "empty" page, so instead of failing to find a page + * with more empty space than an "empty" page and extend the relation, + * we try to find a page which is considered "emtpy".This is slightly more verbose, but I think this clarifies the
reasoning why we need this a bit better. Feel free to reject or adapt
as needed.
I like this in general, but still has some rough edges. I've made another
attempt in v5 incorporating your suggestions. Let me know what you think.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v5-0001-Fix-bug-in-heap-space-management-that-was-overly-.patchapplication/octet-stream; name=v5-0001-Fix-bug-in-heap-space-management-that-was-overly-.patchDownload
From 4857979a461b14fcc15316e0047cfd7c9401d1f6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Fri, 19 Mar 2021 14:13:19 -0400
Subject: [PATCH v5] Fix bug in heap space management that was overly cautious
about fillfactor
Previously, if there were leftover ununused line pointers in a page,
a table with a low fill factor would not be available for new tuples
that were large enough to exceed fillfactor, even though the page
contained no tuples at all. This lead to unnecessary relation
extension.
Fix by allowing some slack space equal to 1/8 the maximum space that
could be taken up by line pointers. This is somewhat arbitrary, but
should allow many cases to succeed where they didn't before.
Per report from Floris van Nee.
---
src/backend/access/heap/hio.c | 43 ++++++++++++++++++++--------
src/test/regress/expected/insert.out | 22 ++++++++++++++
src/test/regress/sql/insert.sql | 23 +++++++++++++++
3 files changed, 76 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 75223c9bea..2a618469db 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -335,11 +335,24 @@ RelationGetBufferForTuple(Relation relation, Size len,
Buffer buffer = InvalidBuffer;
Page page;
Size pageFreeSpace = 0,
- saveFreeSpace = 0;
+ saveFreeSpace = 0,
+ targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
bool needLock;
+ /*
+ * Since pages without tuples as a result of vacuuming can still have some
+ * space occupied by unused item pointers, we consider pages "empty" for
+ * FSM requests when the unavailable space is equivalent to some number of
+ * line pointers. We use 1/8 of MaxHeapTuplesPerPage for the threshold
+ * calculation, which is somewhat arbitrary, but should prevent most
+ * unnecessary relation extensions due to not finding "empty" pages while
+ * inserting large tuples into tables with low fillfactors.
+ */
+ const Size maxPaddedFsmRequest = MaxHeapTupleSize -
+ (MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
+
len = MAXALIGN(len); /* be conservative */
/* Bulk insert is not supported for updates, only inserts. */
@@ -358,6 +371,18 @@ RelationGetBufferForTuple(Relation relation, Size len,
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
+ /*
+ * When the free space to be requested from the FSM is greater than
+ * maxPaddedFsmRequest, this is considered equivalent to requesting
+ * insertion on an "empty" page. If the tuple itself is larger than this,
+ * we request its full length, otherwise we use maxPaddedFsmRequest as a
+ * proxy for "empty". This prevents unnecessarily extending the relation.
+ */
+ if (len + saveFreeSpace > maxPaddedFsmRequest)
+ targetFreeSpace = Max(len, maxPaddedFsmRequest);
+ else
+ targetFreeSpace = len + saveFreeSpace;
+
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
@@ -376,13 +401,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
- if (len + saveFreeSpace > MaxHeapTupleSize)
- {
- /* can't fit, don't bother asking FSM */
- targetBlock = InvalidBlockNumber;
- use_fsm = false;
- }
- else if (bistate && bistate->current_buf != InvalidBuffer)
+ if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
@@ -393,7 +412,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an initial
* target.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
@@ -517,7 +536,7 @@ loop:
}
pageFreeSpace = PageGetHeapFreeSpace(page);
- if (len + saveFreeSpace <= pageFreeSpace)
+ if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
@@ -550,7 +569,7 @@ loop:
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
- len + saveFreeSpace);
+ targetFreeSpace);
}
/*
@@ -582,7 +601,7 @@ loop:
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/*
* If some other waiter has already extended the relation, we
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index da50ee3b67..400a45f799 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -82,6 +82,28 @@ select col1, col2, char_length(col3) from inserttest;
drop table inserttest;
--
+-- test using FSM with large tuples and low fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+-- simulate having a few line pointers on the page so that the
+-- page has free space slightly less than MaxHeapTupleSize
+INSERT INTO large_tuple_test (select 1, NULL);
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+ pg_size_pretty
+----------------
+ 8192 bytes
+(1 row)
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+DROP TABLE large_tuple_test;
+--
-- check indirection (field/array assignment), cf bug #14265
--
-- these tests are aware that transformInsertStmt has 3 separate code paths
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 963faa1614..560ccbc652 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -37,6 +37,29 @@ select col1, col2, char_length(col3) from inserttest;
drop table inserttest;
+--
+-- test using FSM with large tuples and low fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+
+-- simulate having a few line pointers on the page so that the
+-- page has free space slightly less than MaxHeapTupleSize
+INSERT INTO large_tuple_test (select 1, NULL);
+
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+
+DROP TABLE large_tuple_test;
+
--
-- check indirection (field/array assignment), cf bug #14265
--
--
2.22.0
On Fri, 19 Mar 2021 at 19:16, John Naylor <john.naylor@enterprisedb.com> wrote:
On Thu, Mar 18, 2021 at 5:30 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
This is slightly more verbose, but I think this clarifies the
reasoning why we need this a bit better. Feel free to reject or adapt
as needed.I like this in general, but still has some rough edges. I've made another attempt in v5 incorporating your suggestions. Let me know what you think.
That is indeed better.
I believe this is ready, so I've marked it as RFC in the commitfest application.
With regards,
Matthias van de Meent.
I gather this is important when large_upd_rate=rate(cross-page update bytes
for tuples larger than fillfactor) exceeds small_ins_rate=rate(insert bytes
for tuples NOT larger than fillfactor). That is a plausible outcome when
inserts are rare, and table bloat then accrues at
large_upd_rate-small_ins_rate. I agree this patch improves behavior.
Does anyone have a strong opinion on whether to back-patch? I am weakly
inclined not to back-patch, because today's behavior might happen to perform
better when large_upd_rate-small_ins_rate<0. Besides the usual choices of
back-patching or not back-patching, we could back-patch with a stricter
threshold. Suppose we accepted pages for larger-than-fillfactor tuples when
the pages have at least
BLCKSZ-SizeOfPageHeaderData-sizeof(ItemIdData)-MAXALIGN(MAXALIGN(SizeofHeapTupleHeader)+1)+1
bytes of free space. That wouldn't reuse a page containing a one-column
tuple, but it would reuse a page having up to eight line pointers.
On Fri, Mar 19, 2021 at 02:16:22PM -0400, John Naylor wrote:
--- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -335,11 +335,24 @@ RelationGetBufferForTuple(Relation relation, Size len,
+ const Size maxPaddedFsmRequest = MaxHeapTupleSize - + (MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
In evaluating whether this is a good choice of value, I think about the
expected page lifecycle. A tuple barely larger than fillfactor (roughly
len=1+BLCKSZ*fillfactor/100) will start on a roughly-empty page. As long as
the tuple exists, the server will skip that page for inserts. Updates can
cause up to floor(99/fillfactor) same-size versions of the tuple to occupy the
page simultaneously, creating that many line pointers. At the fillfactor=10
minimum, it's good to accept otherwise-empty pages having at least nine line
pointers, so a page can restart the aforementioned lifecycle. Tolerating even
more line pointers helps when updates reduce tuple size or when the page was
used for smaller tuples before it last emptied. At the BLCKSZ=8192 default,
this maxPaddedFsmRequest allows 36 line pointers (good or somewhat high). At
the BLCKSZ=1024 minimum, it allows 4 line pointers (low). At the BLCKSZ=32768
maximum, 146 (likely excessive). I'm not concerned about optimizing
non-default block sizes, so let's keep your proposal.
Comments and the maxPaddedFsmRequest variable name use term "fsm" for things
not specific to the FSM. For example, the patch's test case doesn't use the
FSM. (That is fine. Ordinarily, RelationGetTargetBlock() furnishes its
block. Under CLOBBER_CACHE_ALWAYS, the "try the last page" logic does so. An
FSM-using test would contain a VACUUM.) I plan to commit the attached
version; compared to v5, it updates comments and renames this variable.
Thanks,
nm
Attachments:
fillfactor-insert-large-v6.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Accept slightly-filled pages for tuples larger than fillfactor.
We always inserted a larger-than-fillfactor tuple into a newly-extended
page, even when existing pages were empty or contained nothing but an
unused line pointer. This was unnecessary relation extension. Start
tolerating page usage up to 1/8 the maximum space that could be taken up
by line pointers. This is somewhat arbitrary, but it should allow more
cases to reuse pages. This has no effect on tables with fillfactor=100
(the default).
John Naylor and Floris van Nee. Reviewed by Matthias van de Meent.
Reported by Floris van Nee.
Discussion: https://postgr.es/m/6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 75223c9..08b4e1b 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -317,10 +317,10 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.
*
- * We always try to avoid filling existing pages further than the fillfactor.
- * This is OK since this routine is not consulted when updating a tuple and
- * keeping it on the same page, which is the scenario fillfactor is meant
- * to reserve space for.
+ * We don't fill existing pages further than the fillfactor, except for large
+ * tuples in nearly-empty tables. This is OK since this routine is not
+ * consulted when updating a tuple and keeping it on the same page, which is
+ * the scenario fillfactor is meant to reserve space for.
*
* ereport(ERROR) is allowed here, so this routine *must* be called
* before any (unlogged) changes are made in buffer pool.
@@ -334,8 +334,10 @@ RelationGetBufferForTuple(Relation relation, Size len,
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
Buffer buffer = InvalidBuffer;
Page page;
- Size pageFreeSpace = 0,
- saveFreeSpace = 0;
+ Size nearlyEmptyFreeSpace,
+ pageFreeSpace = 0,
+ saveFreeSpace = 0,
+ targetFreeSpace = 0;
BlockNumber targetBlock,
otherBlock;
bool needLock;
@@ -358,6 +360,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
+ /*
+ * Since pages without tuples can still have line pointers, we consider
+ * pages "empty" when the unavailable space is slight. This threshold is
+ * somewhat arbitrary, but it should prevent most unnecessary relation
+ * extensions while inserting large tuples into low-fillfactor tables.
+ */
+ nearlyEmptyFreeSpace = MaxHeapTupleSize -
+ (MaxHeapTuplesPerPage / 8 * sizeof(ItemIdData));
+ if (len + saveFreeSpace > nearlyEmptyFreeSpace)
+ targetFreeSpace = Max(len, nearlyEmptyFreeSpace);
+ else
+ targetFreeSpace = len + saveFreeSpace;
+
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
@@ -376,13 +391,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
- if (len + saveFreeSpace > MaxHeapTupleSize)
- {
- /* can't fit, don't bother asking FSM */
- targetBlock = InvalidBlockNumber;
- use_fsm = false;
- }
- else if (bistate && bistate->current_buf != InvalidBuffer)
+ if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);
@@ -393,7 +402,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* We have no cached target page, so ask the FSM for an initial
* target.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}
/*
@@ -517,7 +526,7 @@ loop:
}
pageFreeSpace = PageGetHeapFreeSpace(page);
- if (len + saveFreeSpace <= pageFreeSpace)
+ if (targetFreeSpace <= pageFreeSpace)
{
/* use this page as future insert target, too */
RelationSetTargetBlock(relation, targetBlock);
@@ -550,7 +559,7 @@ loop:
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
- len + saveFreeSpace);
+ targetFreeSpace);
}
/*
@@ -582,7 +591,7 @@ loop:
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
- targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/*
* If some other waiter has already extended the relation, we
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index cf13e60..1aff62c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -676,7 +676,11 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
if (len + saveFreeSpace > pageFreeSpace)
{
- /* Doesn't fit, so write out the existing page */
+ /*
+ * Doesn't fit, so write out the existing page. It always
+ * contains a tuple. Hence, unlike RelationGetBufferForTuple(),
+ * enforce saveFreeSpace unconditionally.
+ */
/* XLOG stuff */
if (RelationNeedsWAL(state->rs_new_rel))
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index da50ee3..5063a3d 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -82,6 +82,27 @@ select col1, col2, char_length(col3) from inserttest;
drop table inserttest;
--
+-- tuple larger than fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+-- create page w/ free space in range [nearlyEmptyFreeSpace, MaxHeapTupleSize)
+INSERT INTO large_tuple_test (select 1, NULL);
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+ pg_size_pretty
+----------------
+ 8192 bytes
+(1 row)
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+DROP TABLE large_tuple_test;
+--
-- check indirection (field/array assignment), cf bug #14265
--
-- these tests are aware that transformInsertStmt has 3 separate code paths
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 963faa1..bfaa8a3 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -38,6 +38,28 @@ select col1, col2, char_length(col3) from inserttest;
drop table inserttest;
--
+-- tuple larger than fillfactor
+--
+CREATE TABLE large_tuple_test (a int, b text) WITH (fillfactor = 10);
+ALTER TABLE large_tuple_test ALTER COLUMN b SET STORAGE plain;
+
+-- create page w/ free space in range [nearlyEmptyFreeSpace, MaxHeapTupleSize)
+INSERT INTO large_tuple_test (select 1, NULL);
+
+-- should still fit on the page
+INSERT INTO large_tuple_test (select 2, repeat('a', 1000));
+SELECT pg_size_pretty(pg_relation_size('large_tuple_test'::regclass, 'main'));
+
+-- add small record to the second page
+INSERT INTO large_tuple_test (select 3, NULL);
+
+-- now this tuple won't fit on the second page, but the insert should
+-- still succeed by extending the relation
+INSERT INTO large_tuple_test (select 4, repeat('a', 8126));
+
+DROP TABLE large_tuple_test;
+
+--
-- check indirection (field/array assignment), cf bug #14265
--
-- these tests are aware that transformInsertStmt has 3 separate code paths
Hi Noah,
Thanks for taking a look at this patch.
In evaluating whether this is a good choice of value, I think about the
expected page lifecycle. A tuple barely larger than fillfactor (roughly
len=1+BLCKSZ*fillfactor/100) will start on a roughly-empty page. As long as
the tuple exists, the server will skip that page for inserts. Updates can cause
up to floor(99/fillfactor) same-size versions of the tuple to occupy the page
simultaneously, creating that many line pointers. At the fillfactor=10
minimum, it's good to accept otherwise-empty pages having at least nine line
pointers, so a page can restart the aforementioned lifecycle. Tolerating even
more line pointers helps when updates reduce tuple size or when the page
was used for smaller tuples before it last emptied. At the BLCKSZ=8192
default, this maxPaddedFsmRequest allows 36 line pointers (good or
somewhat high). At the BLCKSZ=1024 minimum, it allows 4 line pointers
(low). At the BLCKSZ=32768 maximum, 146 (likely excessive). I'm not
concerned about optimizing non-default block sizes, so let's keep your
proposal.
Agreed. You briefly mention this already, but the case that caused me to report this was exactly the one where under normal circumstances each UPDATE would be small. However, in rare cases, the tuple that is updated grows in size to 1k bytes (the specific case we encountered sometimes would under specific circumstances write extra info in a field, which would otherwise be NULL). Suppose that this 1k UPDATE does not fit into the current page (so no HOT update), then a new page would be created (HEAD behavior). However, it is very likely that the next updates to this same tuple will be the regular size again. This causes the higher number of line pointers on the page.
-Floris
On Sat, Mar 27, 2021 at 3:00 AM Noah Misch <noah@leadboat.com> wrote:
Does anyone have a strong opinion on whether to back-patch? I am weakly
inclined not to back-patch, because today's behavior might happen to
perform
better when large_upd_rate-small_ins_rate<0.
It's not a clear case. The present behavior is clearly a bug, but only
manifests in rare situations. The risk of the fix affecting other
situations is not zero, as you mention, but (thinking briefly about this
and I could be wrong) the consequences don't seem as big as the reported
case of growing table size.
Besides the usual choices of
back-patching or not back-patching, we could back-patch with a stricter
threshold. Suppose we accepted pages for larger-than-fillfactor tuples
when
the pages have at least
BLCKSZ-SizeOfPageHeaderData-sizeof(ItemIdData)-MAXALIGN(MAXALIGN(SizeofHeapTupleHeader)+1)+1
bytes of free space. That wouldn't reuse a page containing a one-column
tuple, but it would reuse a page having up to eight line pointers.
I'm not sure how much that would help in the reported case that started
this thread.
Comments and the maxPaddedFsmRequest variable name use term "fsm" for
things
not specific to the FSM. For example, the patch's test case doesn't use
the
FSM. (That is fine. Ordinarily, RelationGetTargetBlock() furnishes its
block. Under CLOBBER_CACHE_ALWAYS, the "try the last page" logic does
so. An
FSM-using test would contain a VACUUM.) I plan to commit the attached
version; compared to v5, it updates comments and renames this variable.
Looks good to me, thanks!
--
John Naylor
EDB: http://www.enterprisedb.com
On Sat, Mar 27, 2021 at 11:26:47AM -0400, John Naylor wrote:
On Sat, Mar 27, 2021 at 3:00 AM Noah Misch <noah@leadboat.com> wrote:
Does anyone have a strong opinion on whether to back-patch?� I am weakly
inclined not to back-patch, because today's behavior might happen to perform
better when large_upd_rate-small_ins_rate<0.It's not a clear case. The present behavior is clearly a bug, but only
manifests in rare situations. The risk of the fix affecting other situations
is not zero, as you mention, but (thinking briefly about this and I could be
wrong) the consequences don't seem as big as the reported case of growing
table size.
I agree sites that are hurting now will see a net benefit. I can call it a
bug that we treat just-extended pages differently from existing
zero-line-pointer pages (e.g. pages left by RelationAddExtraBlocks()).
Changing how we treat pages having 100 bytes of data feels different to me.
It's more like a policy decision, not a clear bug fix.
I'm open to back-patching, but I plan to do so only if a few people report
being firmly in favor.
Besides the usual choices of
back-patching or not back-patching, we could back-patch with a stricter
threshold.� Suppose we accepted pages for larger-than-fillfactor tuples when
the pages have at least
BLCKSZ-SizeOfPageHeaderData-sizeof(ItemIdData)-MAXALIGN(MAXALIGN(SizeofHeapTupleHeader)+1)+1
bytes of free space.� That wouldn't reuse a page containing a one-column
tuple, but it would reuse a page having up to eight line pointers.I'm not sure how much that would help in the reported case that started this thread.
I'm not sure, either. The thread email just before yours (27 Mar 2021
10:24:00 +0000) does suggest it would help less than the main proposal.