Disabling Heap-Only Tuples
Hi,
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.
This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.
Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.
This feature could be used to shrink tables in one of two ways:
temporarily disabling HOT until DML operations have compacted the data
into a smaller area, or performing a mass update on later rows to
relocate them to an earlier location, probably in stages. Of course,
this would need to be used in conjunction with a VACUUM operation.
Admittedly this isn't ideal, and it would be better if we had an
operation that could do this (e.g. VACUUM COMPACT <table_name>), or an
option that causes some operations to avoid HOT when it detects an
amount of free space over a threshold, but in lieu of those, I thought
this would at least allow users to help themselves when running into
disk space issues.
Thoughts?
Thom
On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.
Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.
Hmm... I see your point. It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page. So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.
So $subject should really be "Allow forcing UPDATEs off the same page".
Thom
On Wed, 5 Jul 2023 at 13:03, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.Hmm... I see your point. It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page. So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.So $subject should really be "Allow forcing UPDATEs off the same page".
You'd probably want to do that only for a certain range of the table -
for a table with 1GB of data and 3GB of bloat there is no good reason
to force page-crossing updates in the first 1GB of the table - all
tuples of the table will eventually reside there, so why would you
take a performance penalty and move the tuples from inside that range
to inside that same range?
Something else to note: Indexes would suffer some (large?) amount of
bloat in this process, as you would be updating a lot of tuples
without the HOT optimization, thus increasing the work to be done by
VACUUM.
This may result in more bloat in indexes than what you get back from
shrinking the table.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On Wed, 5 Jul 2023 at 13:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jul 2023 at 13:03, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.Hmm... I see your point. It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page. So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.So $subject should really be "Allow forcing UPDATEs off the same page".
You'd probably want to do that only for a certain range of the table -
for a table with 1GB of data and 3GB of bloat there is no good reason
to force page-crossing updates in the first 1GB of the table - all
tuples of the table will eventually reside there, so why would you
take a performance penalty and move the tuples from inside that range
to inside that same range?
I'm thinking more of a case of:
<magic to stop UPDATES from landing on same page>
UPDATE bigtable
SET primary key = primary key
WHERE ctid IN (
SELECT ctid
FROM bigtable
ORDER BY ctid DESC
LIMIT 100000);
Something else to note: Indexes would suffer some (large?) amount of
bloat in this process, as you would be updating a lot of tuples
without the HOT optimization, thus increasing the work to be done by
VACUUM.
This may result in more bloat in indexes than what you get back from
shrinking the table.
This could be the case, but I guess indexes are expendable to an
extent, unlike tables.
Thom
On Wed, 5 Jul 2023 at 14:39, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 13:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 13:03, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.Hmm... I see your point. It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page. So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.So $subject should really be "Allow forcing UPDATEs off the same page".
You'd probably want to do that only for a certain range of the table -
for a table with 1GB of data and 3GB of bloat there is no good reason
to force page-crossing updates in the first 1GB of the table - all
tuples of the table will eventually reside there, so why would you
take a performance penalty and move the tuples from inside that range
to inside that same range?I'm thinking more of a case of:
<magic to stop UPDATES from landing on same page>
UPDATE bigtable
SET primary key = primary key
WHERE ctid IN (
SELECT ctid
FROM bigtable
ORDER BY ctid DESC
LIMIT 100000);
So what were you thinking of? A session GUC? A table option?
The benefit of a table option is that it is retained across sessions
and thus allows tables that get enough updates to eventually get to a
cleaner state. The main downside of such a table option is that it
requires a temporary table-level lock to update the parameter.
The benefit of a session GUC is that you can set it without impacting
other sessions, but the downside is that you need to do the
maintenance in that session, and risk that cascading updates to other
tables (e.g. through AFTER UPDATE triggers) are also impacted by this
non-local update GUC.
Something else to note: Indexes would suffer some (large?) amount of
bloat in this process, as you would be updating a lot of tuples
without the HOT optimization, thus increasing the work to be done by
VACUUM.
This may result in more bloat in indexes than what you get back from
shrinking the table.This could be the case, but I guess indexes are expendable to an
extent, unlike tables.
I don't think that's accurate - index rebuilds are quite expensive.
But, that's besides the point of this thread.
Somewhat related: did you consider using pg_repack instead of this
potential feature?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jul 2023 at 14:39, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 13:12, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 13:03, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.Hmm... I see your point. It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page. So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.So $subject should really be "Allow forcing UPDATEs off the same page".
You'd probably want to do that only for a certain range of the table -
for a table with 1GB of data and 3GB of bloat there is no good reason
to force page-crossing updates in the first 1GB of the table - all
tuples of the table will eventually reside there, so why would you
take a performance penalty and move the tuples from inside that range
to inside that same range?I'm thinking more of a case of:
<magic to stop UPDATES from landing on same page>
UPDATE bigtable
SET primary key = primary key
WHERE ctid IN (
SELECT ctid
FROM bigtable
ORDER BY ctid DESC
LIMIT 100000);So what were you thinking of? A session GUC? A table option?
Both.
The benefit of a table option is that it is retained across sessions
and thus allows tables that get enough updates to eventually get to a
cleaner state. The main downside of such a table option is that it
requires a temporary table-level lock to update the parameter.
Yes, but the maintenance window to make such a change would be extremely brief.
The benefit of a session GUC is that you can set it without impacting
other sessions, but the downside is that you need to do the
maintenance in that session, and risk that cascading updates to other
tables (e.g. through AFTER UPDATE triggers) are also impacted by this
non-local update GUC.Something else to note: Indexes would suffer some (large?) amount of
bloat in this process, as you would be updating a lot of tuples
without the HOT optimization, thus increasing the work to be done by
VACUUM.
This may result in more bloat in indexes than what you get back from
shrinking the table.This could be the case, but I guess indexes are expendable to an
extent, unlike tables.I don't think that's accurate - index rebuilds are quite expensive.
But, that's besides the point of this thread.Somewhat related: did you consider using pg_repack instead of this
potential feature?
pg_repack isn't exactly innocuous, and can leave potentially the
database in an irrevocable state. Plus, if disk space is an issue, it
doesn't help.
Thom
On Wed, 2023-07-05 at 12:02 +0100, Thom Brown wrote:
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jul 2023 at 12:45, Thom Brown <thom@linux.com> wrote:
Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.Hmm... I see your point. It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page. So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.So $subject should really be "Allow forcing UPDATEs off the same page".
I've been thinking about the same thing - an option that changes the update
strategy to always use the lowest block with enough free space.
That would allow to consolidate bloated tables with no down time.
Yours,
Laurenz Albe
On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.
I decided that max_local_update would be in MB because there is no
reloption value that can contain MaxBlockNumber and -1/disabled; and 1
MiB seems like enough granularity for essentially all use cases.
The added regression tests show how this feature works, that the new
feature works, and validate that lock levels are acceptable
(ShareUpdateExclusiveLock, same as for updating fillfactor).
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
Attachments:
v1-0001-Implement-a-reloption-that-forces-updated-tuples-.patchapplication/octet-stream; name=v1-0001-Implement-a-reloption-that-forces-updated-tuples-.patchDownload+124-5
On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.
So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages. But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page. Is my understanding correct?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages. But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page. Is my understanding correct?
Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.
Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On 7/7/23 11:55, Matthias van de Meent wrote:
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages. But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page. Is my understanding correct?Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.
I know the thread started as "let's disable HOT" and this essentially
just proposes to do that using a table option. But I wonder if that's
far too simple to be reliable, because hoping RelationGetBufferForTuple
happens to do the right thing does not seem great.
I wonder if we should invent some definition of "strategy" that would
tell RelationGetBufferForTuple what it should aim for ...
I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).
Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jul 7, 2023 at 3:48 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 7/7/23 11:55, Matthias van de Meent wrote:
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages. But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page. Is my understanding correct?Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.I know the thread started as "let's disable HOT" and this essentially
just proposes to do that using a table option. But I wonder if that's
far too simple to be reliable, because hoping RelationGetBufferForTuple
happens to do the right thing does not seem great.I wonder if we should invent some definition of "strategy" that would
tell RelationGetBufferForTuple what it should aim for ...I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.
IMHO that seems like the right direction for this feature to be
useful. Otherwise just forcing it to select a page using
RelationGetBufferForTuple() without any input or direction to this
function can behave pretty randomly. In fact, there should be some
way to say insert a new tuple in a smaller block number first
(provided they have free space) and with that, we might get an
opportunity to truncate some heap pages by vacuum.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, 2023-07-07 at 16:27 +0530, Dilip Kumar wrote:
On Fri, Jul 7, 2023 at 3:48 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.IMHO that seems like the right direction for this feature to be
useful.
Right, I agree. A GUC/storage parameter like "update_strategy"
that is an enum (try-hot | first-page | ...).
To preserve BRIN indexes or CLUSTERed tables, there could be an additional
"insert_strategy", but that would somehow have to be tied to a certain
index. I think that is out of scope for this effort.
Yours,
Laurenz Albe
On Thu, 6 Jul 2023 at 21:18, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.I decided that max_local_update would be in MB because there is no
reloption value that can contain MaxBlockNumber and -1/disabled; and 1
MiB seems like enough granularity for essentially all use cases.The added regression tests show how this feature works, that the new
feature works, and validate that lock levels are acceptable
(ShareUpdateExclusiveLock, same as for updating fillfactor).
Wow, thanks for working on this.
I've given it a test, and it does what I would expect it to do.
I'm aware of the concerns about the potential for the relocation to
land in an undesirable location, so perhaps that needs addressing.
But this is already considerably better than the current need to
update a row until it gets pushed off its current page. Ideally there
would be tooling built around this where the user wouldn't need to
figure out how much of the table to UPDATE, or deal with VACUUMing
concerns.
But here's my quick test:
CREATE OR REPLACE FUNCTION compact_table(table_name IN TEXT)
RETURNS VOID AS $$
DECLARE
current_row RECORD;
old_ctid TID;
new_ctid TID;
keys TEXT;
update_query TEXT;
row_counter INTEGER := 0;
BEGIN
SELECT string_agg(a.attname || ' = ' || a.attname, ', ')
INTO keys
FROM
pg_index i
JOIN
pg_attribute a ON a.attnum = ANY(i.indkey)
WHERE
i.indrelid = table_name::regclass
AND a.attrelid = table_name::regclass
AND i.indisprimary;
IF keys IS NULL THEN
RAISE EXCEPTION 'Table % does not have a primary key.', table_name;
END IF;
FOR current_row IN
EXECUTE FORMAT('SELECT ctid, * FROM %I ORDER BY ctid DESC', table_name)
LOOP
old_ctid := current_row.ctid;
update_query := FORMAT('UPDATE %I SET %s WHERE ctid = $1
RETURNING ctid', table_name, keys);
EXECUTE update_query USING old_ctid INTO new_ctid;
row_counter := row_counter + 1;
IF row_counter % 1000 = 0 THEN
RAISE NOTICE '% rows relocated.', row_counter;
END IF;
IF new_ctid <= old_ctid THEN
CONTINUE;
ELSE
RAISE NOTICE 'All non-contiguous rows relocated.';
EXIT;
END IF;
END LOOP;
END; $$
LANGUAGE plpgsql;
postgres=# CREATE TABLE bigtable (id int, content text);
CREATE TABLE
postgres=# INSERT INTO bigtable SELECT x, 'This is just a way to fill
up space.' FROM generate_series(1,10000000) a(x);
INSERT 0 10000000
postgres=# DELETE FROM bigtable WHERE id % 7 = 0;
DELETE 1428571
postgres=# VACUUM bigtable;
VACUUM
postgres=# ALTER TABLE bigtable SET (max_local_update = 0);
ALTER TABLE
postgres=# ALTER TABLE bigtable ADD PRIMARY KEY (id);
ALTER TABLE
postgres=# \dt+ bigtable
List of relations
Schema | Name | Type | Owner | Persistence | Access method |
Size | Description
--------+----------+-------+-------+-------------+---------------+--------+-------------
public | bigtable | table | thom | permanent | heap | 730 MB |
(1 row)
postgres=# SELECT * FROM pgstattuple('bigtable');
table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
-----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+--------------
765607936 | 8571429 | 557142885 | 72.77 |
0 | 0 | 0 | 105901628 | 13.83
(1 row)
postgres=# SELECT compact_table('bigtable');
NOTICE: 1000 rows relocated.
NOTICE: 2000 rows relocated.
NOTICE: 3000 rows relocated.
NOTICE: 4000 rows relocated.
...
NOTICE: 1221000 rows relocated.
NOTICE: 1222000 rows relocated.
NOTICE: 1223000 rows relocated.
NOTICE: 1224000 rows relocated.
NOTICE: All non-contiguous rows relocated.
compact_table
---------------
(1 row)
postgres=# VACUUM bigtable;
VACUUM
postgres=# \dt+ bigtable;
List of relations
Schema | Name | Type | Owner | Persistence | Access method |
Size | Description
--------+----------+-------+-------+-------------+---------------+--------+-------------
public | bigtable | table | thom | permanent | heap | 626 MB |
(1 row)
postgres=# SELECT * FROM pgstattuple('bigtable');
table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
-----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+--------------
656236544 | 8571429 | 557142885 | 84.9 |
0 | 0 | 0 | 2564888 | 0.39
(1 row)
Works for me.
Thom
On Fri, 7 Jul 2023 at 13:18, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 7/7/23 11:55, Matthias van de Meent wrote:
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages. But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page. Is my understanding correct?Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.I know the thread started as "let's disable HOT" and this essentially
just proposes to do that using a table option. But I wonder if that's
far too simple to be reliable, because hoping RelationGetBufferForTuple
happens to do the right thing does not seem great.I wonder if we should invent some definition of "strategy" that would
tell RelationGetBufferForTuple what it should aim for ...I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.
Just as another point in support of strategy based/extensible tuple
placement, I would at some point try out placing INSERT ON CONFLICT
tuples on the same page as the preceding key in the index. Use case is
in tables with (series, timestamp) primary key to get locality of
access range scanning for a single series. Placement will always be a
tradeoff that is dependent on hardware and workload, and the effect
can be pretty large. For the mentioned use case, if placement can
maintain some semblance of clustering, there will be a 10-100x
reduction in buffers accessed for a relatively minor increase in
bloat.
--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com
On Thu, 2023-07-06 at 22:18 +0200, Matthias van de Meent wrote:
On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.I decided that max_local_update would be in MB because there is no
reloption value that can contain MaxBlockNumber and -1/disabled; and 1
MiB seems like enough granularity for essentially all use cases.The added regression tests show how this feature works, that the new
feature works, and validate that lock levels are acceptable
(ShareUpdateExclusiveLock, same as for updating fillfactor).
I have looked at your patch, and I must say that I like it. Having
a size limit is better than my original idea of just "on" or "off".
Essentially, it is "try to shrink the table if it grows above a limit".
The patch builds fine and passes all regression tests.
Documentation is missing.
I agree that the name "max_local_update" could be improved.
Perhaps "avoid_hot_above_size_mb".
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -342,6 +342,7 @@ typedef struct StdRdOptions
int parallel_workers; /* max number of parallel workers */
StdRdOptIndexCleanup vacuum_index_cleanup; /* controls index vacuuming */
bool vacuum_truncate; /* enables vacuum to truncate a relation */
+ int max_local_update; /* Updates to pages after this block must go through the VM */
} StdRdOptions;
#define HEAP_MIN_FILLFACTOR 10
In the comment, it should be FSM, not VM, right?
Other than that, I see nothing wrong.
Yours,
Laurenz Albe
On Wed, 19 Jul 2023, 13:58 Laurenz Albe, <laurenz.albe@cybertec.at> wrote:
On Thu, 2023-07-06 at 22:18 +0200, Matthias van de Meent wrote:
On Wed, 5 Jul 2023 at 19:55, Thom Brown <thom@linux.com> wrote:
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:So what were you thinking of? A session GUC? A table option?
Both.
Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.I decided that max_local_update would be in MB because there is no
reloption value that can contain MaxBlockNumber and -1/disabled; and 1
MiB seems like enough granularity for essentially all use cases.The added regression tests show how this feature works, that the new
feature works, and validate that lock levels are acceptable
(ShareUpdateExclusiveLock, same as for updating fillfactor).I have looked at your patch, and I must say that I like it. Having
a size limit is better than my original idea of just "on" or "off".
Essentially, it is "try to shrink the table if it grows above a limit".The patch builds fine and passes all regression tests.
Documentation is missing.
I agree that the name "max_local_update" could be improved.
Perhaps "avoid_hot_above_size_mb".
Or "hot_table_size_threshold" or "hot_update_limit"?
Thom
On Fri, 7 Jul 2023 at 12:18, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 7/7/23 11:55, Matthias van de Meent wrote:
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar <dilipbalaut@gmail.com> wrote:
So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages. But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page. Is my understanding correct?Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.I know the thread started as "let's disable HOT" and this essentially
just proposes to do that using a table option. But I wonder if that's
far too simple to be reliable, because hoping RelationGetBufferForTuple
happens to do the right thing does not seem great.I wonder if we should invent some definition of "strategy" that would
tell RelationGetBufferForTuple what it should aim for ...I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).
I mostly agree, but the point is that first we have to get the update
away from the page. Once we've done that, we can start getting smart
about placement in RelationGetBufferForTuple, but unless we decide to
not put the tuple on the old tuple's page no code from
RelationGetBufferForTuple is executed.
We could change the update code to always go through
RelationGetBufferForTuple to determine the target buffer, and make
that function consider page-local updates (instead of heap_update, who
does that now), but I think that'd need significant extra work in
other callsites of RelationGetBufferForTuple as well as that function
itself.
Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.
Improved tuple routing sounds like a great idea, and I've thought
about it as well. I'm not sure whether BRIN (as-is) is the best
candidate though, considering its O(N) scan complexity - 100GB-scale
tables can reasonably have BRIN indexes of MBs, and running a scan on
that is not likely to have good performance.
If BRIN had hierarchical summaries (e.g. if we had range summaries for
data stored in every nonnegative power of 16 of page ranges) then we
could reduce that to something more reasonable, but that's not
currently implemented and so I don't think that's quite relevant yet.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
On Wed, 19 Jul 2023 at 15:13, Thom Brown <thom@linux.com> wrote:
On Wed, 19 Jul 2023, 13:58 Laurenz Albe, <laurenz.albe@cybertec.at> wrote:
I agree that the name "max_local_update" could be improved.
Perhaps "avoid_hot_above_size_mb".Or "hot_table_size_threshold" or "hot_update_limit"?
Although I like these names, it doesn't quite cover the use of the
parameter for me, as updated tuples prefer to be inserted on the same
page as the old tuple regardless of whether HOT applies.
Example: a bloated table test(
id int primary key,
num_updates int,
unique (id, num_updates)
)
would be assumed to remain bloated if I'd set a parameter named
something_hot_something, as all updates would be non-hot and thus
should not be influenced by the GUC/parameter.
How about 'local_update_limit'?
Kind regards,
Matthias van de Meent