[PATCH] Vacuum: Update FSM more frequently
As discussed in the vacuum ring buffer thread:
Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.
Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.
Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.
In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.
Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.
Autovacuum canceling is thus handled by updating the FSM first-thing
when autovacuum retries vacuuming a relation. I attempted to add
an autovacuum work item for performing the FSM update shortly
after the cancel, but that had some issues so I abandoned the approach.
For one, it would sometimes crash when adding the work item from
inside autovacuum itself. I didn't find the cause of the crash, but I suspect
AutoVacuumRequestWork was being called in a context where it
was not safe. In any case, the way work items work didn't seem like
it would have worked for our purposes anyway, since they will only ever
be processed after processing all tables in the database, something that
could take ages, the work items are limited to 256, which would make
the approach troublesome for databases with more than 256 tables that
trigger this case, and it required de-duplicating work items which had
quadratic cost without major refactoring of the feature.
So, patch attached. I'll add it to the next CF as well.
Attachments:
0001-Vacuum-Update-FSM-more-frequently.patchtext/x-patch; charset=US-ASCII; name=0001-Vacuum-Update-FSM-more-frequently.patchDownload+90-18
A few general comments.
+ FreeSpaceMapVacuum(onerel, 64);
Just want to know why '64' is used here? It's better to give a description.
+ else
+ {
+ newslot = fsm_get_avail(page, 0);
+ }
Since there is only one line in the else the bracket will not be needed.
And there in one more space ahead of else which should be removed.
+ if (nindexes == 0 &&
+ (vacuumed_pages_at_fsm_vac - vacuumed_pages) >
vacuum_fsm_every_pages)
+ {
+ /* Vacuum the Free Space Map */
+ FreeSpaceMapVacuum(onerel, 0);
+ vacuumed_pages_at_fsm_vac = vacuumed_pages;
+ }
vacuumed_pages_at_fsm_vac is initialised with value zero and seems no
chance to go into the bracket.
Regards,
Jing Wang
Fujitsu Australia
On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwangian@gmail.com> wrote:
A few general comments.
+ FreeSpaceMapVacuum(onerel, 64);
Just want to know why '64' is used here? It's better to give a description.
+ else + { + newslot = fsm_get_avail(page, 0); + }Since there is only one line in the else the bracket will not be needed. And
there in one more space ahead of else which should be removed.+ if (nindexes == 0 && + (vacuumed_pages_at_fsm_vac - vacuumed_pages) > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, 0); + vacuumed_pages_at_fsm_vac = vacuumed_pages; + }vacuumed_pages_at_fsm_vac is initialised with value zero and seems no chance
to go into the bracket.
The patch presented still applies, and there has been a review two
days ago. This is a short delay to answer for the author, so I am
moving this patch to next CF with the same status of waiting on
author.
--
Michael
Greetings Claudio,
* Michael Paquier (michael.paquier@gmail.com) wrote:
On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwangian@gmail.com> wrote:
A few general comments.
+ FreeSpaceMapVacuum(onerel, 64);
Just want to know why '64' is used here? It's better to give a description.
+ else + { + newslot = fsm_get_avail(page, 0); + }Since there is only one line in the else the bracket will not be needed. And
there in one more space ahead of else which should be removed.+ if (nindexes == 0 && + (vacuumed_pages_at_fsm_vac - vacuumed_pages) > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, 0); + vacuumed_pages_at_fsm_vac = vacuumed_pages; + }vacuumed_pages_at_fsm_vac is initialised with value zero and seems no chance
to go into the bracket.The patch presented still applies, and there has been a review two
days ago. This is a short delay to answer for the author, so I am
moving this patch to next CF with the same status of waiting on
author.
Looks like this patch probably still applies and the changes suggested
above seem pretty small, any chance you can provide an updated patch
soon Claudio, so that it can be reviewed properly during this
commitfest?
Thanks!
Stephen
On Sat, Jan 6, 2018 at 7:33 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings Claudio,
* Michael Paquier (michael.paquier@gmail.com) wrote:
On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwangian@gmail.com>
wrote:
A few general comments.
+ FreeSpaceMapVacuum(onerel, 64);
Just want to know why '64' is used here? It's better to give a
description.
+ else + { + newslot = fsm_get_avail(page, 0); + }Since there is only one line in the else the bracket will not be
needed. And
there in one more space ahead of else which should be removed.
+ if (nindexes == 0 && + (vacuumed_pages_at_fsm_vac - vacuumed_pages) > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, 0); + vacuumed_pages_at_fsm_vac = vacuumed_pages; + }vacuumed_pages_at_fsm_vac is initialised with value zero and seems no
chance
to go into the bracket.
The patch presented still applies, and there has been a review two
days ago. This is a short delay to answer for the author, so I am
moving this patch to next CF with the same status of waiting on
author.Looks like this patch probably still applies and the changes suggested
above seem pretty small, any chance you can provide an updated patch
soon Claudio, so that it can be reviewed properly during this
commitfest?
I failed to notice the comments among the list's noise, sorry.
I'll get on to it now.
On Mon, Nov 27, 2017 at 2:39 AM, Jing Wang <jingwangian@gmail.com> wrote:
A few general comments.
+ FreeSpaceMapVacuum(onerel, 64);
Just want to know why '64' is used here? It's better to give a
description.
It's just a random cutoff point.
The point of that vacuum run is to mark free space early, to avoid needless
relation extension before long vacuum runs finish. This only happens if the
FSM is mostly zeroes, if there's free space in there, it will be used.
To make this run fast, since it's all redundant work before the final FSM
vacuum pass, branches with more than that amount of free space are skipped.
There's little point in vacuuming those early since they already contain
free space. This helps avoid the quadratic cost of vacuuming the FSM every
N dead tuples, since already-vacuumed branches will have free space, and
will thus be skipped. It could still be quadratic in the worst case, but it
avoids it often enough.
As for the value itself, it's an arbitrary choice. In-between index passes,
we have a rather precise cutoff we can use to avoid this redundant work.
But in the first run, we don't, so I made an arbitrary choice there that
felt right. I have no empirical performance data about alternative values.
+ else + { + newslot = fsm_get_avail(page, 0); + }Since there is only one line in the else the bracket will not be needed.
And there in one more space ahead of else which should be removed.
Ok
+ if (nindexes == 0 && + (vacuumed_pages_at_fsm_vac - vacuumed_pages) > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, 0); + vacuumed_pages_at_fsm_vac = vacuumed_pages; + }vacuumed_pages_at_fsm_vac is initialised with value zero and seems no
chance to go into the bracket.
You're right, the difference there is backwards
Attached patch with the proposed changes.
Attachments:
0001-Vacuum-Update-FSM-more-frequently-v2.patchtext/x-patch; charset=US-ASCII; name=0001-Vacuum-Update-FSM-more-frequently-v2.patchDownload+102-18
On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.
Hmm, I think this resolve a part of the issue. How about calling
AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
and give the relid that we were vacuuming but could not complete as a
new autovacuum work-item? The new autovacuum work-item makes the
worker vacuum FSMs of the given relation and its indices. That way, we
can ensure that FSM gets vacuumed by the cancelled autovacuum process
or other autovacuum processes. Since a work-item can be handled by
other autovacuum process I think 256 work-item limit would not be a
problem. Or it might be better to make autovacuum launcher launch
worker process if there is pending work-item in a database even if
there is no table needs to be vacuumed/analyzed.
For one, it would sometimes crash when adding the work item from
inside autovacuum itself. I didn't find the cause of the crash, but I suspect
AutoVacuumRequestWork was being called in a context where it
was not safe.
Perhaps the cause of this might be that autovacuum work-item is
implemented using DSA, which has been changed before by commit
31ae1638ce35c23979f9bcbb92c6bb51744dbccb.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.Hmm, I think this resolve a part of the issue. How about calling
AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
and give the relid that we were vacuuming but could not complete as a
new autovacuum work-item? The new autovacuum work-item makes the
worker vacuum FSMs of the given relation and its indices.
Well, I tried that in fact, as I mentioned in the OP.
I abandoned it due to the conjunction of the 2 main blockers I found
and mentioned there. In essence, those issues defeat the purpose of
the patch (to get the free space visible ASAP).
Don't forget, this is aimed at cases where autovacuum of a single
relation takes a very long time. That is, very big relations. Maybe
days, like in my case. A whole autovacuum cycle can take weeks, so
delaying FSM vacuum that much is not good, and using work items still
cause those delays, not to mention the segfaults.
That way, we
can ensure that FSM gets vacuumed by the cancelled autovacuum process
or other autovacuum processes. Since a work-item can be handled by
other autovacuum process I think 256 work-item limit would not be a
problem.
Why do you think it wouldn't? In particular if you take into account
the above. If you have more than 256 relations in the cluster, it
could very well happen that you've queued the maximum amount and no
autovac worker has had a chance to take a look at them, because
they're all stuck vacuuming huge relations.
Not to mention the need to de-duplicate work items. We wouldn't want
to request repeated FSM vacuums, or worst, queue an FSM vacuum of a
single table 256 times and fill up the queue with redundant items.
With the current structure, de-duplication is O(N), so if we wanted to
raise the limit of 256 work items, we'd need a structure that would
let us de-duplicate in less than O(N). In essence, it's a ton of work
for very little gain. Hence why I abandoned it.
Or it might be better to make autovacuum launcher launch
worker process if there is pending work-item in a database even if
there is no table needs to be vacuumed/analyzed.
Quite a must, in fact.
For one, it would sometimes crash when adding the work item from
inside autovacuum itself. I didn't find the cause of the crash, but I suspect
AutoVacuumRequestWork was being called in a context where it
was not safe.Perhaps the cause of this might be that autovacuum work-item is
implemented using DSA, which has been changed before by commit
31ae1638ce35c23979f9bcbb92c6bb51744dbccb.
I thought about that. Perhaps. But the work-item approach didn't fail
just because of the segfaults. It's also that it doesn't address the
free space visibility issue quickly enough.
On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.Hmm, I think this resolve a part of the issue. How about calling
AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
and give the relid that we were vacuuming but could not complete as a
new autovacuum work-item? The new autovacuum work-item makes the
worker vacuum FSMs of the given relation and its indices.Well, I tried that in fact, as I mentioned in the OP.
I abandoned it due to the conjunction of the 2 main blockers I found
and mentioned there. In essence, those issues defeat the purpose of
the patch (to get the free space visible ASAP).Don't forget, this is aimed at cases where autovacuum of a single
relation takes a very long time. That is, very big relations. Maybe
days, like in my case. A whole autovacuum cycle can take weeks, so
delaying FSM vacuum that much is not good, and using work items still
cause those delays, not to mention the segfaults.
Yeah, I agree to vacuum fsm more frequently because it can prevent
table bloating from concurrent modifying. But considering the way to
prevent the table bloating from cancellation of autovacuum, I guess we
need more things. This proposal seems to provide us an ability that is
we "might be" able to prevent table bloating due to cancellation of
autovacuum. Since we can know that autovacuum is cancelled, I'd like
to have a way so that we can surely vacuum fsm even if vacuum is
cancelled. Thoughts?
Also the patch always vacuums fsm at beginning of the vacuum with a
threshold but it is useless if the table has been properly vacuumed. I
don't think it's good idea to add an additional step that "might be"
efficient, because current vacuum is already heavy.
That way, we
can ensure that FSM gets vacuumed by the cancelled autovacuum process
or other autovacuum processes. Since a work-item can be handled by
other autovacuum process I think 256 work-item limit would not be a
problem.Why do you think it wouldn't? In particular if you take into account
the above. If you have more than 256 relations in the cluster, it
could very well happen that you've queued the maximum amount and no
autovac worker has had a chance to take a look at them, because
they're all stuck vacuuming huge relations.Not to mention the need to de-duplicate work items. We wouldn't want
to request repeated FSM vacuums, or worst, queue an FSM vacuum of a
single table 256 times and fill up the queue with redundant items.
With the current structure, de-duplication is O(N), so if we wanted to
raise the limit of 256 work items, we'd need a structure that would
let us de-duplicate in less than O(N). In essence, it's a ton of work
for very little gain. Hence why I abandoned it.
I'd missed something. Agreed with you.
On detail of the patch,
--- a/src/backend/storage/freespace/indexfsm.c
+++ b/src/backend/storage/freespace/indexfsm.c
@@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
void
IndexFreeSpaceMapVacuum(Relation rel)
{
- FreeSpaceMapVacuum(rel);
+ FreeSpaceMapVacuum(rel, 0);
}
@@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
bool *eof_p)
{
int child_avail;
+ /* Tree pruning for partial vacuums */
+ if (threshold)
+ {
+ child_avail = fsm_get_avail(page, slot);
+ if (child_avail >= threshold)
+ continue;
+ }
Don't we skip all fsm pages if we set the threshold to 0?
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Thu, Feb 1, 2018 at 9:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.Hmm, I think this resolve a part of the issue. How about calling
AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
and give the relid that we were vacuuming but could not complete as a
new autovacuum work-item? The new autovacuum work-item makes the
worker vacuum FSMs of the given relation and its indices.Well, I tried that in fact, as I mentioned in the OP.
I abandoned it due to the conjunction of the 2 main blockers I found
and mentioned there. In essence, those issues defeat the purpose of
the patch (to get the free space visible ASAP).Don't forget, this is aimed at cases where autovacuum of a single
relation takes a very long time. That is, very big relations. Maybe
days, like in my case. A whole autovacuum cycle can take weeks, so
delaying FSM vacuum that much is not good, and using work items still
cause those delays, not to mention the segfaults.Yeah, I agree to vacuum fsm more frequently because it can prevent
table bloating from concurrent modifying. But considering the way to
prevent the table bloating from cancellation of autovacuum, I guess we
need more things. This proposal seems to provide us an ability that is
we "might be" able to prevent table bloating due to cancellation of
autovacuum. Since we can know that autovacuum is cancelled, I'd like
to have a way so that we can surely vacuum fsm even if vacuum is
cancelled. Thoughts?
After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.
So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.
That's why an initial FSM vacuum makes sense. It has a similar timing
to the autovacuum work item, it has the benefit that it can be
triggered manually (manual vacuum), and it's cheap and efficient.
Also the patch always vacuums fsm at beginning of the vacuum with a
threshold but it is useless if the table has been properly vacuumed. I
don't think it's good idea to add an additional step that "might be"
efficient, because current vacuum is already heavy.
FSMs are several orders of magnitude smaller than the tables
themselves. A TB-sized table I have here has a 95M FSM. If you add
threshold skipping, that initial FSM vacuum *will* be efficient.
By definition, the FSM will be less than 1/4000th of the table, so
that initial FSM pass takes less than 1/4000th of the whole vacuum.
Considerably less considering the simplicity of the task.
On detail of the patch,
--- a/src/backend/storage/freespace/indexfsm.c +++ b/src/backend/storage/freespace/indexfsm.c @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) void IndexFreeSpaceMapVacuum(Relation rel) { - FreeSpaceMapVacuum(rel); + FreeSpaceMapVacuum(rel, 0); }@@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
bool *eof_p)
{
int child_avail;+ /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + }Don't we skip all fsm pages if we set the threshold to 0?
No, that doesn't skip anything if threshold is 0, since it doesn't get
to the continue, which is the one skipping nodes.
On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Feb 1, 2018 at 9:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.Hmm, I think this resolve a part of the issue. How about calling
AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
and give the relid that we were vacuuming but could not complete as a
new autovacuum work-item? The new autovacuum work-item makes the
worker vacuum FSMs of the given relation and its indices.Well, I tried that in fact, as I mentioned in the OP.
I abandoned it due to the conjunction of the 2 main blockers I found
and mentioned there. In essence, those issues defeat the purpose of
the patch (to get the free space visible ASAP).Don't forget, this is aimed at cases where autovacuum of a single
relation takes a very long time. That is, very big relations. Maybe
days, like in my case. A whole autovacuum cycle can take weeks, so
delaying FSM vacuum that much is not good, and using work items still
cause those delays, not to mention the segfaults.Yeah, I agree to vacuum fsm more frequently because it can prevent
table bloating from concurrent modifying. But considering the way to
prevent the table bloating from cancellation of autovacuum, I guess we
need more things. This proposal seems to provide us an ability that is
we "might be" able to prevent table bloating due to cancellation of
autovacuum. Since we can know that autovacuum is cancelled, I'd like
to have a way so that we can surely vacuum fsm even if vacuum is
cancelled. Thoughts?After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.
I think that's not true if there are multiple databases.
That's why an initial FSM vacuum makes sense. It has a similar timing
to the autovacuum work item, it has the benefit that it can be
triggered manually (manual vacuum), and it's cheap and efficient.Also the patch always vacuums fsm at beginning of the vacuum with a
threshold but it is useless if the table has been properly vacuumed. I
don't think it's good idea to add an additional step that "might be"
efficient, because current vacuum is already heavy.FSMs are several orders of magnitude smaller than the tables
themselves. A TB-sized table I have here has a 95M FSM. If you add
threshold skipping, that initial FSM vacuum *will* be efficient.By definition, the FSM will be less than 1/4000th of the table, so
that initial FSM pass takes less than 1/4000th of the whole vacuum.
Considerably less considering the simplicity of the task.
I agree the fsm is very smaller than heap and vacuum on fsm will not
be comparatively heavy but I'm afraid that the vacuum will get more
heavy in the future if we pile up such improvement that are small but
might not be efficient. For example, a feature for reporting the last
vacuum status has been proposed[1]/messages/by-id/20171010.192616.108347483.horiguchi.kyotaro@lab.ntt.co.jp. I wonder if we can use it to
determine whether we do the fsm vacuum at beginning of vacuum.
On detail of the patch,
--- a/src/backend/storage/freespace/indexfsm.c +++ b/src/backend/storage/freespace/indexfsm.c @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) void IndexFreeSpaceMapVacuum(Relation rel) { - FreeSpaceMapVacuum(rel); + FreeSpaceMapVacuum(rel, 0); }@@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
bool *eof_p)
{
int child_avail;+ /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + }Don't we skip all fsm pages if we set the threshold to 0?
No, that doesn't skip anything if threshold is 0, since it doesn't get
to the continue, which is the one skipping nodes.
Oops, yes you're right. Sorry for the noise.
[1]: /messages/by-id/20171010.192616.108347483.horiguchi.kyotaro@lab.ntt.co.jp
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.I think that's not true if there are multiple databases.
I'd have to re-check.
The loop is basically, IIRC:
while(1) { vacuum db ; work items ; nap }
Now, if that's vacuum one db, not all, and if the decision on the
second run doesn't pick the same db because that big table failed to
be vacuumed, then you're right.
In that case we could add the FSM vacuum as a work item *in addition*
to what this patch does. If the work item queue is full and the FSM
vacuum doesn't happen, it'd be no worse than with the patch as-is.
Is that what you suggest?
With that in mind, I'm noticing WorkItems have a avw_database that
isn't checked by do_autovacuum. Is that right? Shouldn't only work
items that belong to the database being autovacuumed be processed?
That's why an initial FSM vacuum makes sense. It has a similar timing
to the autovacuum work item, it has the benefit that it can be
triggered manually (manual vacuum), and it's cheap and efficient.Also the patch always vacuums fsm at beginning of the vacuum with a
threshold but it is useless if the table has been properly vacuumed. I
don't think it's good idea to add an additional step that "might be"
efficient, because current vacuum is already heavy.FSMs are several orders of magnitude smaller than the tables
themselves. A TB-sized table I have here has a 95M FSM. If you add
threshold skipping, that initial FSM vacuum *will* be efficient.By definition, the FSM will be less than 1/4000th of the table, so
that initial FSM pass takes less than 1/4000th of the whole vacuum.
Considerably less considering the simplicity of the task.I agree the fsm is very smaller than heap and vacuum on fsm will not
be comparatively heavy but I'm afraid that the vacuum will get more
heavy in the future if we pile up such improvement that are small but
might not be efficient. For example, a feature for reporting the last
vacuum status has been proposed[1]. I wonder if we can use it to
determine whether we do the fsm vacuum at beginning of vacuum.
Yes, such a feature would allow skipping that initial FSM vacuum. That
can be improved in a future patch if that proposed feature gets
merged. This patch can be treated independently from that in the
meanwhile, don't you think?
On Mon, Feb 5, 2018 at 2:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
With that in mind, I'm noticing WorkItems have a avw_database that
isn't checked by do_autovacuum. Is that right? Shouldn't only work
items that belong to the database being autovacuumed be processed?
NVM. I had to git pull, it's fixed in head.
On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.I think that's not true if there are multiple databases.
I'd have to re-check.
The loop is basically, IIRC:
while(1) { vacuum db ; work items ; nap }
Now, if that's vacuum one db, not all, and if the decision on the
second run doesn't pick the same db because that big table failed to
be vacuumed, then you're right.In that case we could add the FSM vacuum as a work item *in addition*
to what this patch does. If the work item queue is full and the FSM
vacuum doesn't happen, it'd be no worse than with the patch as-is.Is that what you suggest?
That's one of the my suggestion. I might had to consider this issue
for each case. To be clear let me summarize for each case.
For table with indices, vacuum on fsm of heap is done after
lazy_vacuum_heap(). Therefore I think that the case where a table got
vacuumed but fsm couldn't get vacuumed doesn't happen unless the
autovacuum gets cancelled before or while vacuuming fsm. So it can
solve the problem in most cases. Also we can use a vacuum progress
information to check whether we should vacuum fsm of table after got
cancelled. For vacuuming fsm of index, we might have to consider to
vacuum fsm of index after lazy_vacuum_index.
For table with no index, it would be more complicated; similar to
table with indices we can vacuum fsm of table more frequently but
table bloating still can happen. Considering a way to surely vacuum
fsm of table there are some approaches:
(1) using autovacuum work-item, (2) vacuuming fsm of table in
PG_CATCH, (3) remembering the tables got cancelled and vacuuming them
after finished a loop of table_oids.
For (1), we have a issue that is work-item queue will be full when
many tables get cancelled and it's not good idea to queue many
redundant work-items. For (2), it would not be a good way to vacuum
fsm of table immediately after cancelled because immediately after got
cancelled the table still likely to being locked by others. For (3),
that might work fine but it can happen that other autovacum worker
vacuums the fsm of table before processing the list. But it would be
better than we always vacuums them at beginning of vacuum. So I'm in
favor of (3). Even when processing tables in the list, we should take
a lock on the table conditionally so that the autovacuum doesn't block
any foreground work. However, it's quite possible that I'm not seeing
the whole picture here.
With that in mind, I'm noticing WorkItems have a avw_database that
isn't checked by do_autovacuum. Is that right? Shouldn't only work
items that belong to the database being autovacuumed be processed?That's why an initial FSM vacuum makes sense. It has a similar timing
to the autovacuum work item, it has the benefit that it can be
triggered manually (manual vacuum), and it's cheap and efficient.Also the patch always vacuums fsm at beginning of the vacuum with a
threshold but it is useless if the table has been properly vacuumed. I
don't think it's good idea to add an additional step that "might be"
efficient, because current vacuum is already heavy.FSMs are several orders of magnitude smaller than the tables
themselves. A TB-sized table I have here has a 95M FSM. If you add
threshold skipping, that initial FSM vacuum *will* be efficient.By definition, the FSM will be less than 1/4000th of the table, so
that initial FSM pass takes less than 1/4000th of the whole vacuum.
Considerably less considering the simplicity of the task.I agree the fsm is very smaller than heap and vacuum on fsm will not
be comparatively heavy but I'm afraid that the vacuum will get more
heavy in the future if we pile up such improvement that are small but
might not be efficient. For example, a feature for reporting the last
vacuum status has been proposed[1]. I wonder if we can use it to
determine whether we do the fsm vacuum at beginning of vacuum.Yes, such a feature would allow skipping that initial FSM vacuum. That
can be improved in a future patch if that proposed feature gets
merged. This patch can be treated independently from that in the
meanwhile, don't you think?
IMO, I'd like to have it after we could have such an optimization.
Otherwise it will result in adding extra steps for every vacuums until
we get the optimization. But, as you said, we also can think it's no
problem because the vacuum fsm of table at beginning of the vacuum has
the threshold. I'd like to hear more opinions about this.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.I think that's not true if there are multiple databases.
I'd have to re-check.
The loop is basically, IIRC:
while(1) { vacuum db ; work items ; nap }
Now, if that's vacuum one db, not all, and if the decision on the
second run doesn't pick the same db because that big table failed to
be vacuumed, then you're right.In that case we could add the FSM vacuum as a work item *in addition*
to what this patch does. If the work item queue is full and the FSM
vacuum doesn't happen, it'd be no worse than with the patch as-is.Is that what you suggest?
That's one of the my suggestion. I might had to consider this issue
for each case. To be clear let me summarize for each case.For table with indices, vacuum on fsm of heap is done after
lazy_vacuum_heap(). Therefore I think that the case where a table got
vacuumed but fsm couldn't get vacuumed doesn't happen unless the
autovacuum gets cancelled before or while vacuuming fsm.
Well, that's the whole point. Autovacuums get cancelled all the time
in highly contended tables. I have more than a handful tables in
production that never finish autovacuum, so I have to trigger manual
vacuums periodically.
(1) using autovacuum work-item, (2) vacuuming fsm of table in
PG_CATCH, (3) remembering the tables got cancelled and vacuuming them
after finished a loop of table_oids.
...
So I'm in favor of (3).
...
However, it's quite possible that I'm not seeing the whole picture here.
Well, none of that handles the case where the autovacuum of a table
doesn't get cancelled, but takes a very long time.
No free space becomes visible during long-running vacuums. That means
bloat keeps accumulating even though vacuum is freeing space, because
the FSM doesn't expose that free space.
The extra work incurred in those FSM vacuums isn't useless, it's
preventing bloat in long-running vacuums.
I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.
On Tue, Feb 6, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
No free space becomes visible during long-running vacuums. That means
bloat keeps accumulating even though vacuum is freeing space, because
the FSM doesn't expose that free space.The extra work incurred in those FSM vacuums isn't useless, it's
preventing bloat in long-running vacuums.
+1.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.I think that's not true if there are multiple databases.
I'd have to re-check.
The loop is basically, IIRC:
while(1) { vacuum db ; work items ; nap }
Now, if that's vacuum one db, not all, and if the decision on the
second run doesn't pick the same db because that big table failed to
be vacuumed, then you're right.In that case we could add the FSM vacuum as a work item *in addition*
to what this patch does. If the work item queue is full and the FSM
vacuum doesn't happen, it'd be no worse than with the patch as-is.Is that what you suggest?
That's one of the my suggestion. I might had to consider this issue
for each case. To be clear let me summarize for each case.For table with indices, vacuum on fsm of heap is done after
lazy_vacuum_heap(). Therefore I think that the case where a table got
vacuumed but fsm couldn't get vacuumed doesn't happen unless the
autovacuum gets cancelled before or while vacuuming fsm.Well, that's the whole point. Autovacuums get cancelled all the time
in highly contended tables. I have more than a handful tables in
production that never finish autovacuum, so I have to trigger manual
vacuums periodically.(1) using autovacuum work-item, (2) vacuuming fsm of table in
PG_CATCH, (3) remembering the tables got cancelled and vacuuming them
after finished a loop of table_oids....
So I'm in favor of (3).
...
However, it's quite possible that I'm not seeing the whole picture here.
Well, none of that handles the case where the autovacuum of a table
doesn't get cancelled, but takes a very long time.No free space becomes visible during long-running vacuums. That means
bloat keeps accumulating even though vacuum is freeing space, because
the FSM doesn't expose that free space.The extra work incurred in those FSM vacuums isn't useless, it's
preventing bloat in long-running vacuums.I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.
Understood. So how about that this patch focuses only make FSM vacuum
more frequently and leaves the initial FSM vacuum and the handling
cancellation cases? The rest can be a separate patch.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.Understood. So how about that this patch focuses only make FSM vacuum
more frequently and leaves the initial FSM vacuum and the handling
cancellation cases? The rest can be a separate patch.
Ok.
Attached split patches. All but the initial FSM pass is in 1, the
initial FSM pass as in the original patch is in 2.
I will post an alternative patch for 2 when/if I have one. In the
meanwhile, we can talk about 1.
Attachments:
0002-Vacuum-do-a-partial-FSM-vacuum-at-the-beginning.patchtext/x-patch; charset=US-ASCII; name=0002-Vacuum-do-a-partial-FSM-vacuum-at-the-beginning.patchDownload+24-1
0001-Vacuum-Update-FSM-more-frequently-v3.patchtext/x-patch; charset=US-ASCII; name=0001-Vacuum-Update-FSM-more-frequently-v3.patchDownload+78-18
On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.Understood. So how about that this patch focuses only make FSM vacuum
more frequently and leaves the initial FSM vacuum and the handling
cancellation cases? The rest can be a separate patch.Ok.
Attached split patches. All but the initial FSM pass is in 1, the
initial FSM pass as in the original patch is in 2.I will post an alternative patch for 2 when/if I have one. In the
meanwhile, we can talk about 1.
Thank you for updating the patch!
+ /* Tree pruning for partial vacuums */
+ if (threshold)
+ {
+ child_avail = fsm_get_avail(page, slot);
+ if (child_avail >= threshold)
+ continue;
+ }
I think slots in a non-leaf page might not have a correct value
because fsm_set_avail doesn't propagate up beyond pages. So do we need
to check the root of child page rather than a slot in the non-lead
page? I might be missing something though.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.Understood. So how about that this patch focuses only make FSM vacuum
more frequently and leaves the initial FSM vacuum and the handling
cancellation cases? The rest can be a separate patch.Ok.
Attached split patches. All but the initial FSM pass is in 1, the
initial FSM pass as in the original patch is in 2.I will post an alternative patch for 2 when/if I have one. In the
meanwhile, we can talk about 1.Thank you for updating the patch!
+ /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + }I think slots in a non-leaf page might not have a correct value
because fsm_set_avail doesn't propagate up beyond pages. So do we need
to check the root of child page rather than a slot in the non-lead
page? I might be missing something though.
I'm not sure I understand what you're asking.
Yes, a partial FSM vacuum won't correct those inconsistencies. It
doesn't matter though, because as long as the root node (or whichever
inner node we're looking at the time) reports free space, the lookup
for free space will recurse and find the lower nodes, even if they
report more space than the inner node reported. The goal of making
free space discoverable will have been achieved nonetheless.
The final FSM vacuum pass isn't partial, to finally correct all those
small inconsistencies.
On Fri, Feb 9, 2018 at 11:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.Understood. So how about that this patch focuses only make FSM vacuum
more frequently and leaves the initial FSM vacuum and the handling
cancellation cases? The rest can be a separate patch.Ok.
Attached split patches. All but the initial FSM pass is in 1, the
initial FSM pass as in the original patch is in 2.I will post an alternative patch for 2 when/if I have one. In the
meanwhile, we can talk about 1.Thank you for updating the patch!
+ /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + }I think slots in a non-leaf page might not have a correct value
because fsm_set_avail doesn't propagate up beyond pages. So do we need
to check the root of child page rather than a slot in the non-lead
page? I might be missing something though.I'm not sure I understand what you're asking.
Yes, a partial FSM vacuum won't correct those inconsistencies. It
doesn't matter though, because as long as the root node (or whichever
inner node we're looking at the time) reports free space, the lookup
for free space will recurse and find the lower nodes, even if they
report more space than the inner node reported. The goal of making
free space discoverable will have been achieved nonetheless.
For example, if a slot of internal node of fsm tree has a value
greater than the threshold while the root of leaf node of fsm tree has
a value less than threshold, we want to update the internal node of
fsm tree but we will skip it with your patch. I wonder if this can
happen.
The final FSM vacuum pass isn't partial, to finally correct all those
small inconsistencies.
Yes, but the purpose of this patch is to prevent table bloating from
concurrent modifications?
Here is other review comments.
+ /* Tree pruning for partial vacuums */
+ if (threshold)
+ {
+ child_avail = fsm_get_avail(page, slot);
+ if (child_avail >= threshold)
+ continue;
+ }
'child_avail' is a category value while 'threshold' is an actual size
of freespace.
The logic of finding the max_freespace seems not work fine if table
with indices because max_freespace is not updated based on
post-compaction freespace. I think we need to get the max freespace
size during vacuuming heap (i.g. at lazy_vacuum_heap) and use it as
the threshold.
+ vacuum_fsm_every_pages = nblocks / 8;
+ vacuum_fsm_every_pages = Max(vacuum_fsm_every_pages, 1048576);
I think a comment for 1048576 is needed. And perhaps we can define it
as a macro.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center