Open issues for HOT patch
I have finished a first review pass over all of the HOT patch
(updated code is posted on -patches). I haven't found any showstoppers,
but there seem still several areas that need discussion:
* The patch makes undocumented changes that cause autovacuum's decisions
to be driven by total estimated dead space rather than total number of
dead tuples. Do we like this? What should happen to the default
threshold parameters (they are not even in the same units as before...)?
Is there any value in even continuing to track dead tuple counts, per
se, in the pgstats machinery? It seems redundant/expensive to track
both tuple counts and byte counts, and it's not like the size of the
stats file is not already known to be a performance issue ...
* I'm still pretty unhappy about the patch's use of a relcache copy of
GetAvgFSMRequestSize()'s result. The fact that there's no provision for
ever updating the value while the relcache entry lives is part of it,
but the bigger part is that I'd rather not have anything at all
depending on that number. FSM in its current form desperately needs to
die; and once it's replaced by some form of distributed on-disk storage,
it's unlikely that we will have any simple means of getting an
equivalent number. The average request size was never meant for
external use anyway, but only as a filter to help reject useless entries
from getting into the limited shared-memory FSM space. Perhaps we could
replace that heuristic with something that is page-local; seems like
dividing the total used space by the number of item pointers would give
at least a rough approximation of the page's average tuple size.
* We also need to think harder about when to invoke the page pruning
code. As the patch stands, if you set a breakpoint at
heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
system catalog probe), which seems uselessly often. And yet it also
seems not often enough, because one thing I found out real fast is that
the "prune if free space < 1.2 average tuple size" heuristic fails badly
when you look at queries that execute multiple updates within the same
heap page. We only prune when we first pin a particular target page,
and so the additional updates don't afford another chance to see if it's
time to prune.
I'd like to see if we can arrange to only do pruning when reading a page
that is known to be an update target (ie, never during plain SELECTs);
I suspect this would be relatively easy with some executor and perhaps
planner changes. But that only fixes the first half of the gripe above;
I'm not at all sure what to do about the multiple-updates-per-page
issue.
Comments?
regards, tom lane
Tom Lane wrote:
* We also need to think harder about when to invoke the page pruning
code. As the patch stands, if you set a breakpoint at
heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
system catalog probe), which seems uselessly often. And yet it also
seems not often enough, because one thing I found out real fast is that
the "prune if free space < 1.2 average tuple size" heuristic fails badly
when you look at queries that execute multiple updates within the same
heap page. We only prune when we first pin a particular target page,
and so the additional updates don't afford another chance to see if it's
time to prune.I'd like to see if we can arrange to only do pruning when reading a page
that is known to be an update target (ie, never during plain SELECTs);
I suspect this would be relatively easy with some executor and perhaps
planner changes. But that only fixes the first half of the gripe above;
I'm not at all sure what to do about the multiple-updates-per-page
issue.
If we only prune on an update (or insert) why not just do prune every
time? I figure the prune/defrag has to be lighter than the
update/insert itself.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
If we only prune on an update (or insert) why not just do prune every
time?
The problem is you can't prune anymore once you have existing pin on the
target page. I'd really like to get around that, but so far it seems
unacceptably fragile --- the executor really doesn't expect tuples to
get moved around underneath it.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
If we only prune on an update (or insert) why not just do prune every
time?The problem is you can't prune anymore once you have existing pin on the
target page. I'd really like to get around that, but so far it seems
unacceptably fragile --- the executor really doesn't expect tuples to
get moved around underneath it.
I thought you could do the pruning before you pin the page only in
update/insert cases.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
The problem is you can't prune anymore once you have existing pin on the
target page. I'd really like to get around that, but so far it seems
unacceptably fragile --- the executor really doesn't expect tuples to
get moved around underneath it.
I thought you could do the pruning before you pin the page only in
update/insert cases.
But then what happens when you want to update a second tuple on the same
page? None of our existing plan types release and reacquire pin if they
don't have to, and I really doubt that we want to give up that
optimization.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
The problem is you can't prune anymore once you have existing pin on the
target page. I'd really like to get around that, but so far it seems
unacceptably fragile --- the executor really doesn't expect tuples to
get moved around underneath it.I thought you could do the pruning before you pin the page only in
update/insert cases.But then what happens when you want to update a second tuple on the same
page? None of our existing plan types release and reacquire pin if they
don't have to, and I really doubt that we want to give up that
optimization.
You will prune when you lock the page and at that point unless you got
enough room for both tuples I doubt trying just before the second tuple
is going to help.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
If we only prune on an update (or insert) why not just do prune every
time? I figure the prune/defrag has to be lighter than the
update/insert itself.
Pruning is a quite costly operation. You need to check the visibility of
each tuple on the page, following tuple chains as you go, mark line
pointers as not used or redirected, and finally memmove all the tuples
to remove the gaps between them. And it needs to be WAL-logged.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote:
* We also need to think harder about when to invoke the page pruning
code. As the patch stands, if you set a breakpoint at
heap_page_prune_opt it'll seem to be hit constantly (eg, once for every
system catalog probe), which seems uselessly often. And yet it also
seems not often enough, because one thing I found out real fast is that
the "prune if free space < 1.2 average tuple size" heuristic fails badly
when you look at queries that execute multiple updates within the same
heap page. We only prune when we first pin a particular target page,
and so the additional updates don't afford another chance to see if it's
time to prune.I'd like to see if we can arrange to only do pruning when reading a page
that is known to be an update target (ie, never during plain SELECTs);
I suspect this would be relatively easy with some executor and perhaps
planner changes. But that only fixes the first half of the gripe above;
I'm not at all sure what to do about the multiple-updates-per-page
issue.
There is one wacky idea I haven't dared to propose yet:
We could lift the limitation that you can't defragment a page that's
pinned, if we play some smoke and mirrors in the buffer manager. When
you prune a page, make a *copy* of the page you're pruning, and keep
both versions in the buffer cache. Old pointers keep pointing to the old
version. Any new calls to ReadBuffer will return the new copy, and the
old copy can be dropped when its pin count drops to zero.
Tracking multiple copies of a page requires some changes to the buffer
manager. LockBuffer would need to return the latest version of the page,
because anything that checks visibility or does updates would need to
use the latest copy, and callers of LockBuffer would need to be changed
accordingly.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
But then what happens when you want to update a second tuple on the same
page? None of our existing plan types release and reacquire pin if they
don't have to, and I really doubt that we want to give up that
optimization.
You will prune when you lock the page and at that point unless you got
enough room for both tuples I doubt trying just before the second tuple
is going to help.
No, you're missing the point completely. If the free space on the page
is, say, 1.5x the average tuple size, the code *won't* prune, and then
it will be stuck when it goes to do the second tuple update, because
there is no chance to reconsider the prune/no-prune decision after some
space is eaten by the first update.
regards, tom lane
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
* I'm still pretty unhappy about the patch's use of a relcache copy of
GetAvgFSMRequestSize()'s result. The fact that there's no provision for
ever updating the value while the relcache entry lives is part of it,
but the bigger part is that I'd rather not have anything at all
depending on that number.
We could fix the first part by adding relcache invalidation whenever
the average FSM request size changes by a certain margin. But
I am not insisting on using the avgFSM mechanism to decide when
to prune.
Perhaps we could
replace that heuristic with something that is page-local; seems like
dividing the total used space by the number of item pointers would give
at least a rough approximation of the page's average tuple size.
We might get it completely wrong unless we know the number of
normal line pointers (redirected, dead and unused line pointers
do not take any real storage).
Another option would be to prune whenever the free space goes
below table fillfactor and hope that users would set fillfactor so that
atleast one updated tuple can fit in the block. I know its not best to
rely on the users though. But it can be good hint.
Yet another option would be to set a hint on the page whenever we
fail to do HOT update because of not-enough-free-space in the
block. Next time we shall prune and so the subsequent updates
would be HOT update.
None of these are perfect though.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
There is one wacky idea I haven't dared to propose yet:
We could lift the limitation that you can't defragment a page that's
pinned, if we play some smoke and mirrors in the buffer manager. When
you prune a page, make a *copy* of the page you're pruning, and keep
both versions in the buffer cache. Old pointers keep pointing to the old
version. Any new calls to ReadBuffer will return the new copy, and the
old copy can be dropped when its pin count drops to zero.
No, that's way too wacky. How do you prevent people from making further
changes to the "old" version? For instance, marking a tuple deleted?
The actual practical application we have, I think, would only require
being able to defrag a page that our own backend has pins on, which is
something that might be more workable --- but it still seems awfully
fragile. It could maybe be made to work in the simplest case of a
plain UPDATE, because in practice I think the executor will never
reference the old tuple's contents after heap_update() returns. But
this falls down in more complex situations involving joins --- we might
continue to try to join the same "old" tuple to other rows, and then any
pass-by-reference Datums we are using are corrupt if the tuple got
moved.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
But then what happens when you want to update a second tuple on the same
page? None of our existing plan types release and reacquire pin if they
don't have to, and I really doubt that we want to give up that
optimization.You will prune when you lock the page and at that point unless you got
enough room for both tuples I doubt trying just before the second tuple
is going to help.No, you're missing the point completely. If the free space on the page
is, say, 1.5x the average tuple size, the code *won't* prune, and then
it will be stuck when it goes to do the second tuple update, because
there is no chance to reconsider the prune/no-prune decision after some
space is eaten by the first update.
My point is that if you only do this for INSERT/UPDATE, you can prune
when you have less than enough room for 3-4 tuples, and if you add the
xmin of the earliest prune xact you can prune even more aggressively.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Perhaps we could
replace that heuristic with something that is page-local; seems like
dividing the total used space by the number of item pointers would give
at least a rough approximation of the page's average tuple size.We might get it completely wrong unless we know the number of
normal line pointers (redirected, dead and unused line pointers
do not take any real storage).
Sure, but it's only a heuristic anyway. Probably a more serious
objection is that it fails in the wrong direction: if you start to get
line pointer bloat then the estimated average tuple size goes down,
making it less likely to prune instead of more. But maybe do something
that looks at free space and number of pointers independently, rather
than operating in terms of average tuple size?
Another option would be to prune whenever the free space goes
below table fillfactor and hope that users would set fillfactor so that
atleast one updated tuple can fit in the block. I know its not best to
rely on the users though. But it can be good hint.
If default fillfactor weren't 100% then this might be good ;-). But
we could use max(1-fillfactor, BLCKSZ/8) or some such.
Yet another option would be to set a hint on the page whenever we
fail to do HOT update because of not-enough-free-space in the
block. Next time we shall prune and so the subsequent updates
would be HOT update.
This would be a good idea independent of anything else, I think.
There's plenty of room for page hint bits, and the write will be
"free" since we must set the old tuple XMAX anyway.
regards, tom lane
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Another option would be to prune whenever the free space goes
below table fillfactorIf default fillfactor weren't 100% then this might be good ;-). But
we could use max(1-fillfactor, BLCKSZ/8) or some such.Yet another option would be to set a hint on the page whenever we
fail to do HOT updateThis would be a good idea independent of anything else, I think.
Or may be a combination of the above two would work well. If the user
has not specified any fillfactor (and so table is using default 100), the
first update is most likely a COLD update and we set the hint bit. The
retired tuple is then pruned before the next update comes and we shall
do HOT update. OTOH if the use has specified a fillfactor less than
to avoid any COLD update, the first update will be HOT. At that point,
the used space would go above fillfactor and hence we shall prune
before the next update comes making room for next HOT update.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
I wrote:
* The patch makes undocumented changes that cause autovacuum's decisions
to be driven by total estimated dead space rather than total number of
dead tuples. Do we like this?
No one seems to have picked up on this point, but after reflection
I think there's actually a pretty big problem here. Per-page pruning
is perfectly capable of keeping dead space in check. In a system with
HOT running well, the reasons to vacuum a table will be:
1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.
Pruning cannot accomplish #1, #2, or #3, and without significant changes
in the FSM infrastructure it has no hope about #4 either. What I'm
afraid of is that steady page-level pruning will keep the amount of dead
space low, causing autovacuum never to fire, causing the indexes to
bloat indefinitely because of #1 and the table itself to bloat
indefinitely because of #2 and #4. Thus, the proposed change in
autovacuum seems badly misguided: instead of making autovacuum trigger
on things that only it can fix, it makes autovacuum trigger on something
that per-page pruning can deal with perfectly well.
I'm inclined to think that we should continue to drive autovac off a
count of dead rows, as this is directly related to points #1 and #2,
and doesn't seem any worse for #3 and #4 than an estimate based on space
would be. Possibly it would be sensible for per-page pruning to report
a reduction in number of dead rows when it removes heap-only tuples,
but I'm not entirely sure --- any thoughts?
If we do this, then it's not clear that having pgstats track dead space
is worth the trouble at all. It might possibly be of value for testing
purposes to see how well pruning is doing, but I'm unconvinced that it's
worth bloating stats messages and files to have this number in a
production system. An alternative that would serve as well for testing
would be to teach contrib/pgstattuple to measure dead space.
Comments?
regards, tom lane
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Tom Lane wrote:
I wrote:
* The patch makes undocumented changes that cause autovacuum's decisions
to be driven by total estimated dead space rather than total number of
dead tuples. Do we like this?
If we do this, then it's not clear that having pgstats track dead space
is worth the trouble at all. It might possibly be of value for testing
purposes to see how well pruning is doing, but I'm unconvinced that it's
worth bloating stats messages and files to have this number in a
production system. An alternative that would serve as well for testing
would be to teach contrib/pgstattuple to measure dead space.
As a DBA, I can say it doesn't really matter to me *how we track* the
dead space, as long as tracking it is:
1. Clear
2. Simple
3. Available by default (thus pgstattuple needs to push into core)
Sincerely,
Joshua D. Drake
- --
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997 http://www.commandprompt.com/
UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFG7/1HATb/zqfZUUQRAorXAJ47OZI8n7Bpj4pRyxRO1nGCUC7L0wCgojPZ
74vcXOZ1KqTFKw8v/w4WngI=
=Bpc2
-----END PGP SIGNATURE-----
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
* The patch makes undocumented changes that cause autovacuum's decisions
to be driven by total estimated dead space rather than total number of
dead tuples. Do we like this?No one seems to have picked up on this point, but after reflection
I think there's actually a pretty big problem here. Per-page pruning
is perfectly capable of keeping dead space in check. In a system with
HOT running well, the reasons to vacuum a table will be:1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.Pruning cannot accomplish #1, #2, or #3, and without significant changes
in the FSM infrastructure it has no hope about #4 either.
I guess we already have mechanism to remove dead index entries
outside vacuum. So my take would be handle vacuum based on dead
index entries separately. May be we can track number of dead index
entries and trigger vacuum on the base relation if it goes beyond a
threshold.
For LP_DEAD line pointers, with some adjustments to the patch, we
can make it track dead space in a page by accounting for the
the LP_DEAD pointers. So if there is a bloat because of LP_DEAD
pointers, that will be reflected in the dead space and help us trigger
vacuum on the table.
I'm inclined to think that we should continue to drive autovac off a
count of dead rows
If we do that, I guess it would make sense to count only non-HOT
dead tuples because HOT tuples neither create LP_DEAD line
pointers nor cause index bloats.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In a system with
HOT running well, the reasons to vacuum a table will be:1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.Pruning cannot accomplish #1, #2, or #3, and without significant changes
in the FSM infrastructure it has no hope about #4 either.
I guess we already have mechanism to remove dead index entries
outside vacuum.
Not a trustworthy one --- unless you have a solid proposal for making it
work with bitmap indexscans, it would be foolish to design autovacuum
behavior on the assumption that dead index entries aren't a problem.
(Also, IIRC only btree has been taught to recover dead entries at all.)
regards, tom lane
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
On 9/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In a system with
HOT running well, the reasons to vacuum a table will be:1. Remove dead index entries.
2. Remove LP_DEAD line pointers.
3. Truncate off no-longer-used end pages.
4. Transfer knowledge about free space into FSM.Pruning cannot accomplish #1, #2, or #3, and without significant
changes
in the FSM infrastructure it has no hope about #4 either.
I guess we already have mechanism to remove dead index entries
outside vacuum.Not a trustworthy one --- unless you have a solid proposal for making it
work with bitmap indexscans, it would be foolish to design autovacuum
behavior on the assumption that dead index entries aren't a problem.
Hmm.. I think we need to drop this for now because I am sure any
such proposal would need a lot more discussion. May be something
we can pick up for 8.4
So we go back to tracking dead tuples. I would still be inclined
towards tracking non-HOT dead tuples or subtract the count of
pruned HOT tuples because we don't want to trigger autovacuum
too often, rather let pruning clean as much dead space as possible.
What it means also that the tuple storage reclaimed by pruning
a non-HOT dead tuples does not impact the autovacuum behavior,
positively or negatively. And ISTM that this does not address 4 ?
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On Tue, Sep 18, 2007 at 11:32:52AM -0400, Tom Lane wrote:
Another option would be to prune whenever the free space goes
below table fillfactor and hope that users would set fillfactor so that
atleast one updated tuple can fit in the block. I know its not best to
rely on the users though. But it can be good hint.If default fillfactor weren't 100% then this might be good ;-). But
Erik Jones and I were just talking about FILLFACTOR...
Is the plan to keep it at 100% with HOT? ISTM that's not such a great
idea, since it forces at least the first update (if not many more) to be
COLD.
I realize that ideally we'd probably want FILLFACTOR to take things like
average tuple size and average number of updates per page into account,
but for a first pass 90% would likely be a good compromise...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828