Logging idle checkpoints
I recently had a sad because I noticed that checkpoint counts were
increasing in pg_stat_bgwriter, but weren't accounted for in my logs
with log_checkpoints enabled.
After some searching, I found that it was the idle checkpoints that
weren't being logged. I think this is a missed trick in 6ef2eba3f57.
Attached is a one-liner fix. I realize how imminent we are from
releasing v10 but I hope there is still time for such a minor issue as this.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Attachments:
Hi,
On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
I recently had a sad because I noticed that checkpoint counts were
increasing in pg_stat_bgwriter, but weren't accounted for in my logs
with log_checkpoints enabled.After some searching, I found that it was the idle checkpoints that
weren't being logged. I think this is a missed trick in 6ef2eba3f57.Attached is a one-liner fix. I realize how imminent we are from
releasing v10 but I hope there is still time for such a minor issue as this.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dd028a12a4..75f6bd4cc1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8724,7 +8724,7 @@ CreateCheckPoint(int flags) WALInsertLockRelease(); LWLockRelease(CheckpointLock); END_CRIT_SECTION(); - ereport(DEBUG1, + ereport(log_checkpoints ? LOG : DEBUG1, (errmsg("checkpoint skipped because system is idle"))); return; }
I'd be ok with applying this now, or in 10.1 - but I do think we should
fix this before 11. If nobody protests I'll push later today, so we can
get some bf cycles for the very remote case that this causes problems.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
I'd be ok with applying this now, or in 10.1 - but I do think we should
fix this before 11. If nobody protests I'll push later today, so we can
get some bf cycles for the very remote case that this causes problems.
This point has been discussed during review and removed from the patch
(adding Stephen in the loop here):
/messages/by-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.com
Actually, shouldn't we make BgWriterStats a bit smarter? We could add
a counter for skipped checkpoints in v11 (too late for v10).
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-02 07:39:18 +0900, Michael Paquier wrote:
On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
I'd be ok with applying this now, or in 10.1 - but I do think we should
fix this before 11. If nobody protests I'll push later today, so we can
get some bf cycles for the very remote case that this causes problems.This point has been discussed during review and removed from the patch
(adding Stephen in the loop here):
/messages/by-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.com
I find that reasoning unconvincing. log_checkpoints is enabled after
all. And we're not talking about 10 log messages a second. There's
plenty systems that analyze the logs that'd possibly be affected by
this.
Actually, shouldn't we make BgWriterStats a bit smarter? We could add
a counter for skipped checkpoints in v11 (too late for v10).
Wouldn't hurt, but seems orthogonal.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 2, 2017 at 7:41 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-10-02 07:39:18 +0900, Michael Paquier wrote:
On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
I'd be ok with applying this now, or in 10.1 - but I do think we should
fix this before 11. If nobody protests I'll push later today, so we can
get some bf cycles for the very remote case that this causes problems.This point has been discussed during review and removed from the patch
(adding Stephen in the loop here):
/messages/by-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.comI find that reasoning unconvincing. log_checkpoints is enabled after
all. And we're not talking about 10 log messages a second. There's
plenty systems that analyze the logs that'd possibly be affected by
this.
No real objections from here, actually.
Actually, shouldn't we make BgWriterStats a bit smarter? We could add
a counter for skipped checkpoints in v11 (too late for v10).Wouldn't hurt, but seems orthogonal.
Sure.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-02 07:43:31 +0900, Michael Paquier wrote:
On Mon, Oct 2, 2017 at 7:41 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-10-02 07:39:18 +0900, Michael Paquier wrote:
On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
I'd be ok with applying this now, or in 10.1 - but I do think we should
fix this before 11. If nobody protests I'll push later today, so we can
get some bf cycles for the very remote case that this causes problems.This point has been discussed during review and removed from the patch
(adding Stephen in the loop here):
/messages/by-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.comI find that reasoning unconvincing. log_checkpoints is enabled after
all. And we're not talking about 10 log messages a second. There's
plenty systems that analyze the logs that'd possibly be affected by
this.No real objections from here, actually.
Vik, because there was some, even though mild, objections, I'd rather
not push this right now. Stephen deserves a chance to reply. So this'll
have to wait for 10.1, sorry :(
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Vik, all,
* Vik Fearing (vik.fearing@2ndquadrant.com) wrote:
I recently had a sad because I noticed that checkpoint counts were
increasing in pg_stat_bgwriter, but weren't accounted for in my logs
with log_checkpoints enabled.
After some searching, I found that it was the idle checkpoints that
weren't being logged. I think this is a missed trick in 6ef2eba3f57.
Attached is a one-liner fix. I realize how imminent we are from
releasing v10 but I hope there is still time for such a minor issue as this.
Idle checkpoints aren't, well, really checkpoints though. If anything,
seems like we shouldn't be including skipped checkpoints in the
pg_stat_bgwriter count because we aren't actually doing a checkpoint.
I certainly don't care for the idea of adding log messages saying we
aren't doing anything just to match a count that's incorrectly claiming
that checkpoints are happening when they aren't.
The down-thread suggestion of keeping track of skipped checkpoints might
be interesting, but I'm not entirely convinced it really is. We have
time to debate that, of course, but I don't really see how that's
helpful. At the moment, it seems like the suggestion to add that column
is based on the assumption that we're going to start logging skipped
checkpoints and having that column would allow us to match up the count
between the new column and the "skipped checkpoint" messages in the logs
and I can not help but feel that this is a ridiculous amount of effort
being put into the analysis of something that *didn't* happen.
Thanks!
Stephen
On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
I certainly don't care for the idea of adding log messages saying we
aren't doing anything just to match a count that's incorrectly claiming
that checkpoints are happening when they aren't.The down-thread suggestion of keeping track of skipped checkpoints might
be interesting, but I'm not entirely convinced it really is. We have
time to debate that, of course, but I don't really see how that's
helpful. At the moment, it seems like the suggestion to add that column
is based on the assumption that we're going to start logging skipped
checkpoints and having that column would allow us to match up the count
between the new column and the "skipped checkpoint" messages in the logs
and I can not help but feel that this is a ridiculous amount of effort
being put into the analysis of something that *didn't* happen.
Being able to look at how many checkpoints are skipped can be used as
a tuning indicator of max_wal_size and checkpoint_timeout, or in short
increase them if those remain idle. Since their introduction in
335feca4, m_timed_checkpoints and m_requested_checkpoints track the
number of checkpoint requests, not if a checkpoint has been actually
executed or not, I am not sure that this should be changed after 10
years. So, to put it in other words, wouldn't we want a way to track
checkpoints that are *executed*, meaning that we could increment a
counter after doing the skip checks in CreateRestartPoint() and
CreateCheckPoint()?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ3Q1J_wBC7yPXk39dO0RGvbM4-nYp2gMrCJ7pfPJXcYw@mail.gmail.com>
On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
I certainly don't care for the idea of adding log messages saying we
aren't doing anything just to match a count that's incorrectly claiming
that checkpoints are happening when they aren't.The down-thread suggestion of keeping track of skipped checkpoints might
be interesting, but I'm not entirely convinced it really is. We have
time to debate that, of course, but I don't really see how that's
helpful. At the moment, it seems like the suggestion to add that column
is based on the assumption that we're going to start logging skipped
checkpoints and having that column would allow us to match up the count
between the new column and the "skipped checkpoint" messages in the logs
and I can not help but feel that this is a ridiculous amount of effort
being put into the analysis of something that *didn't* happen.Being able to look at how many checkpoints are skipped can be used as
a tuning indicator of max_wal_size and checkpoint_timeout, or in short
increase them if those remain idle.
We ususally adjust the GUCs based on how often checkpoint is
*executed* and how many of the executed checkpoints have been
triggered by xlog progress (or with shorter interval than
timeout). It seems enough. Counting skipped checkpoints gives
just a rough estimate of how long the system was getting no
substantial updates. I doubt that users get something valuable by
counting skipped checkpoints.
Since their introduction in
335feca4, m_timed_checkpoints and m_requested_checkpoints track the
number of checkpoint requests, not if a checkpoint has been actually
executed or not, I am not sure that this should be changed after 10
years. So, to put it in other words, wouldn't we want a way to track
checkpoints that are *executed*, meaning that we could increment a
counter after doing the skip checks in CreateRestartPoint() and
CreateCheckPoint()?
This sounds reasonable to me.
CreateRestartPoint() is already returning ckpt_performed, it is
used to let checkpointer retry in 15 seconds rather than waiting
the next checkpoint_timeout. Checkpoint might deserve the same
treatment on skipping.
By the way RestartCheckPoint emits DEBUG2 messages on skipping.
Although restartpoint has different characteristics from
checkpoint, if we change the message level for CreateCheckPoint
(currently DEBUG1), CreateRestartPoint might should get the same
change. (Elsewise at least they ought to have the same message
level?)
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greetings,
* Kyotaro HORIGUCHI (horiguchi.kyotaro@lab.ntt.co.jp) wrote:
At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ3Q1J_wBC7yPXk39dO0RGvbM4-nYp2gMrCJ7pfPJXcYw@mail.gmail.com>
On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
I certainly don't care for the idea of adding log messages saying we
aren't doing anything just to match a count that's incorrectly claiming
that checkpoints are happening when they aren't.The down-thread suggestion of keeping track of skipped checkpoints might
be interesting, but I'm not entirely convinced it really is. We have
time to debate that, of course, but I don't really see how that's
helpful. At the moment, it seems like the suggestion to add that column
is based on the assumption that we're going to start logging skipped
checkpoints and having that column would allow us to match up the count
between the new column and the "skipped checkpoint" messages in the logs
and I can not help but feel that this is a ridiculous amount of effort
being put into the analysis of something that *didn't* happen.Being able to look at how many checkpoints are skipped can be used as
a tuning indicator of max_wal_size and checkpoint_timeout, or in short
increase them if those remain idle.We ususally adjust the GUCs based on how often checkpoint is
*executed* and how many of the executed checkpoints have been
triggered by xlog progress (or with shorter interval than
timeout). It seems enough. Counting skipped checkpoints gives
just a rough estimate of how long the system was getting no
substantial updates. I doubt that users get something valuable by
counting skipped checkpoints.
Yeah, I tend to agree. I don't really see how counting skipped
checkpoints helps to size max_wal_size or even checkpoint_timeout. The
whole point here is that nothing is happening and if nothing is
happening then there's no real need to adjust max_wal_size or
checkpoint_timeout or, well, much of anything really..
Since their introduction in
335feca4, m_timed_checkpoints and m_requested_checkpoints track the
number of checkpoint requests, not if a checkpoint has been actually
executed or not, I am not sure that this should be changed after 10
years. So, to put it in other words, wouldn't we want a way to track
checkpoints that are *executed*, meaning that we could increment a
counter after doing the skip checks in CreateRestartPoint() and
CreateCheckPoint()?This sounds reasonable to me.
I agree that tracking executed checkpoints is valuable, but, and perhaps
I'm missing something, isn't that the same as tracking non-skipped
checkpoints? I suppose we could have both, if we really feel the need,
provided that doesn't result in more work or effort being done than
simply keeping the count. I'd hate to end up in a situation where we're
writing things out unnecessairly just to keep track of checkpoints that
were requested but ultimately skipped because there wasn't anything to
do.
Thanks!
Stephen
At Tue, 3 Oct 2017 08:22:27 -0400, Stephen Frost <sfrost@snowman.net> wrote in <20171003122227.GJ4628@tamriel.snowman.net>
Greetings,
* Kyotaro HORIGUCHI (horiguchi.kyotaro@lab.ntt.co.jp) wrote:
At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ3Q1J_wBC7yPXk39dO0RGvbM4-nYp2gMrCJ7pfPJXcYw@mail.gmail.com>
On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
Since their introduction in
335feca4, m_timed_checkpoints and m_requested_checkpoints track the
number of checkpoint requests, not if a checkpoint has been actually
executed or not, I am not sure that this should be changed after 10
years. So, to put it in other words, wouldn't we want a way to track
checkpoints that are *executed*, meaning that we could increment a
counter after doing the skip checks in CreateRestartPoint() and
CreateCheckPoint()?This sounds reasonable to me.
I agree that tracking executed checkpoints is valuable, but, and perhaps
I'm missing something, isn't that the same as tracking non-skipped
checkpoints? I suppose we could have both, if we really feel the need,
provided that doesn't result in more work or effort being done than
simply keeping the count. I'd hate to end up in a situation where we're
writing things out unnecessairly just to keep track of checkpoints that
were requested but ultimately skipped because there wasn't anything to
do.
I'm fine with counting both executed and skipped. But perhaps the
time of lastest checkpoint fits the concern better, like
vacuum. It is seen in control file but not in system views. If we
have count skipped checkpoints, I'd like to see the time (or LSN)
of last checkpoint in system views.
checkpoints_timed | bigint | | |
checkpoints_req | bigint | | |
+ checkpoints_skipped | bigint
+ last_checkpint | timestamp with time zone or LSN?
# This reminded me of a concern. I'd like to count vacuums that
# are required but skipped by lock-failure, or killed by other
# backend.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kyotaro HORIGUCHI wrote:
# This reminded me of a concern. I'd like to count vacuums that
# are required but skipped by lock-failure, or killed by other
# backend.
We clearly need to improve the stats and logs related to vacuuming work
executed, both by autovacuum and manually invoked. One other item I
have in my head is to report numbers related to the truncation phase of
a vacuum run, since in some cases it causes horrible and hard to
diagnose problems. (Also, add an reloption to stop vacuum from doing
the truncation phase at all -- for some usage patterns that is a serious
problem.)
However, please do open a new thread about it.
--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 5 Oct 2017 13:41:42 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in <20171005114142.dupjeqe2cnplhgkx@alvherre.pgsql>
Kyotaro HORIGUCHI wrote:
# This reminded me of a concern. I'd like to count vacuums that
# are required but skipped by lock-failure, or killed by other
# backend.We clearly need to improve the stats and logs related to vacuuming work
executed, both by autovacuum and manually invoked. One other item I
have in my head is to report numbers related to the truncation phase of
a vacuum run, since in some cases it causes horrible and hard to
diagnose problems. (Also, add an reloption to stop vacuum from doing
the truncation phase at all -- for some usage patterns that is a serious
problem.)However, please do open a new thread about it.
Thanks! Will do after a bit time of organization of the thougts.
reagareds,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello.
Once in a while I am asked about table bloat. In most cases the
cause is long lasting transactions and vacuum canceling in some
cases. Whatever the case users don't have enough clues to why
they have bloated tables.
At the top of the annoyances list for users would be that they
cannot know whether autovacuum decided that a table needs vacuum
or not. I suppose that it could be shown in pg_stat_*_tables.
n_mod_since_analyze | 20000
+ vacuum_requred | true
last_vacuum | 2017-10-10 17:21:54.380805+09
If vacuum_required remains true for a certain time, it means that
vacuuming stopped halfway or someone is killing it repeatedly.
That status could be shown in the same view.
n_mod_since_analyze | 20000
+ vacuum_requred | true
last_vacuum | 2017-10-10 17:21:54.380805+09
last_autovacuum | 2017-10-10 17:21:54.380805+09
+ last_autovacuum_status | Killed by lock conflict
Where the "Killed by lock conflict" would be one of the followings.
- Completed (oldest xmin = 8023)
- May not be fully truncated (yielded at 1324 of 6447 expected)
- Truncation skipped
- Skipped by lock failure
- Killed by lock conflict
If we want more formal expression, we can show the values in the
following shape. And adding some more values could be useful.
n_mod_since_analyze | 20000
+ vacuum_requred | true
+ last_vacuum_oldest_xid | 8023
+ last_vacuum_left_to_truncate | 5123
+ last_vacuum_truncated | 387
last_vacuum | 2017-10-10 17:21:54.380805+09
last_autovacuum | 2017-10-10 17:21:54.380805+09
+ last_autovacuum_status | Killed by lock conflict
...
autovacuum_count | 128
+ incomplete_autovacuum_count | 53
# The last one might be needless..
Where the "Killed by lock conflict" is one of the followings.
- Completed
- Truncation skipped
- Partially truncated
- Skipped
- Killed by lock conflict
This seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.
There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.
I'm going to make a patch to do the 'formal' one for the time
being.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 10, 2017 at 7:26 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.
Once in a while I am asked about table bloat. In most cases the
cause is long lasting transactions and vacuum canceling in some
cases. Whatever the case users don't have enough clues to why
they have bloated tables.At the top of the annoyances list for users would be that they
cannot know whether autovacuum decided that a table needs vacuum
or not. I suppose that it could be shown in pg_stat_*_tables.n_mod_since_analyze | 20000
+ vacuum_requred | true
last_vacuum | 2017-10-10 17:21:54.380805+09If vacuum_required remains true for a certain time, it means that
vacuuming stopped halfway or someone is killing it repeatedly.
That status could be shown in the same view.
Because the table statistics is updated at end of the vacuum I think
that the autovacuum will process the table at the next cycle if it has
stopped halfway or has killed. So you mean that vacuum_required is for
uses who want to reclaim garbage without wait for autovacuum retry?
n_mod_since_analyze | 20000 + vacuum_requred | true last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflictWhere the "Killed by lock conflict" would be one of the followings.
- Completed (oldest xmin = 8023)
- May not be fully truncated (yielded at 1324 of 6447 expected)
- Truncation skipped
- Skipped by lock failure
- Killed by lock conflictIf we want more formal expression, we can show the values in the
following shape. And adding some more values could be useful.n_mod_since_analyze | 20000 + vacuum_requred | true + last_vacuum_oldest_xid | 8023 + last_vacuum_left_to_truncate | 5123 + last_vacuum_truncated | 387 last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflict ... autovacuum_count | 128 + incomplete_autovacuum_count | 53# The last one might be needless..
I'm not sure that the above informations will help for users or DBA
but personally I sometimes want to have the number of index scans of
the last autovacuum in the pg_stat_user_tables view. That value
indicates how efficiently vacuums performed and would be a signal to
increase the setting of autovacuum_work_mem for user.
Where the "Killed by lock conflict" is one of the followings.
- Completed
- Truncation skipped
- Partially truncated
- Skipped
- Killed by lock conflictThis seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.I'm going to make a patch to do the 'formal' one for the time
being.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Mmm. I've failed to create a brand-new thread..
Thank you for the comment.
At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
On Tue, Oct 10, 2017 at 7:26 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello.
Once in a while I am asked about table bloat. In most cases the
cause is long lasting transactions and vacuum canceling in some
cases. Whatever the case users don't have enough clues to why
they have bloated tables.At the top of the annoyances list for users would be that they
cannot know whether autovacuum decided that a table needs vacuum
or not. I suppose that it could be shown in pg_stat_*_tables.n_mod_since_analyze | 20000
+ vacuum_requred | true
last_vacuum | 2017-10-10 17:21:54.380805+09If vacuum_required remains true for a certain time, it means that
vacuuming stopped halfway or someone is killing it repeatedly.
That status could be shown in the same view.Because the table statistics is updated at end of the vacuum I think
that the autovacuum will process the table at the next cycle if it has
stopped halfway or has killed. So you mean that vacuum_required is for
uses who want to reclaim garbage without wait for autovacuum retry?
It could be used for the purpose and for just knowing that a
table is left for a long time needing a vacuum and it would be a
trigger for users to take measures to deal with the situation.
n_mod_since_analyze | 20000 + vacuum_requred | true last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflictWhere the "Killed by lock conflict" would be one of the followings.
- Completed (oldest xmin = 8023)
- May not be fully truncated (yielded at 1324 of 6447 expected)
- Truncation skipped
- Skipped by lock failure
- Killed by lock conflictIf we want more formal expression, we can show the values in the
following shape. And adding some more values could be useful.n_mod_since_analyze | 20000 + vacuum_requred | true + last_vacuum_oldest_xid | 8023 + last_vacuum_left_to_truncate | 5123 + last_vacuum_truncated | 387 last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflict ... autovacuum_count | 128 + incomplete_autovacuum_count | 53# The last one might be needless..
I'm not sure that the above informations will help for users or DBA
but personally I sometimes want to have the number of index scans of
the last autovacuum in the pg_stat_user_tables view. That value
indicates how efficiently vacuums performed and would be a signal to
increase the setting of autovacuum_work_mem for user.
Btree and all existing index AMs (except brin) seem to visit the
all pages in every index scan so it would be valuable. Instead
the number of visited index pages during a table scan might be
usable. It is more relevant to performance than the number of
scans, on the other hand it is a bit difficult to get something
worth from the number in a moment. I'll show the number of scans
in the first cut.
Where the "Killed by lock conflict" is one of the followings.
- Completed
- Truncation skipped
- Partially truncated
- Skipped
- Killed by lock conflictThis seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.I'm going to make a patch to do the 'formal' one for the time
being.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 26 Oct 2017 15:06:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171026.150630.115694437.horiguchi.kyotaro@lab.ntt.co.jp>
At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
n_mod_since_analyze | 20000 + vacuum_requred | true + last_vacuum_oldest_xid | 8023 + last_vacuum_left_to_truncate | 5123 + last_vacuum_truncated | 387 last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflict ... autovacuum_count | 128 + incomplete_autovacuum_count | 53# The last one might be needless..
I'm not sure that the above informations will help for users or DBA
but personally I sometimes want to have the number of index scans of
the last autovacuum in the pg_stat_user_tables view. That value
indicates how efficiently vacuums performed and would be a signal to
increase the setting of autovacuum_work_mem for user.Btree and all existing index AMs (except brin) seem to visit the
all pages in every index scan so it would be valuable. Instead
the number of visited index pages during a table scan might be
usable. It is more relevant to performance than the number of
scans, on the other hand it is a bit difficult to get something
worth from the number in a moment. I'll show the number of scans
in the first cut.Where the "Killed by lock conflict" is one of the followings.
- Completed
- Truncation skipped
- Partially truncated
- Skipped
- Killed by lock conflictThis seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.I'm going to make a patch to do the 'formal' one for the time
being.
Done with small modifications. In the attached patch
pg_stat_all_tables has the following new columns. Documentations
is not provided at this stage.
-----
n_mod_since_analyze | 0
+ vacuum_required | not requried
last_vacuum |
last_autovacuum | 2017-10-30 18:51:32.060551+09
last_analyze |
last_autoanalyze | 2017-10-30 18:48:33.414711+09
vacuum_count | 0
+ last_vacuum_truncated | 0
+ last_vacuum_untruncated | 0
+ last_vacuum_index_scans | 0
+ last_vacuum_oldest_xmin | 2134
+ last_vacuum_status | agressive vacuum completed
+ autovacuum_fail_count | 0
autovacuum_count | 5
analyze_count | 0
autoanalyze_count | 1
-----
Where each column shows the following infomation.
+ vacuum_required | not requried
VACUUM requirement status. Takes the following values.
- partial
Partial (or normal) will be performed by the next autovacuum.
The word "partial" is taken from the comment for
vacuum_set_xid_limits.
- aggressive
Aggressive scan will be performed by the next autovacuum.
- required
Any type of autovacuum will be performed. The type of scan is
unknown because the view failed to take the required lock on
the table. (AutoVacuumrequirement())
- not required
Next autovacuum won't perform scan on this relation.
- not required (lock not acquired)
Autovacuum should be disabled and the distance to
freeze-limit is not known because required lock is not
available.
- close to freeze-limit xid
Shown while autovacuum is disabled. The table is in the
manual vacuum window to avoid anti-wraparound autovacuum.
+ last_vacuum_truncated | 0
The number of truncated pages in the last completed
(auto)vacuum.
+ last_vacuum_untruncated | 0
The number of pages the last completed (auto)vacuum tried to
truncate but could not for some reason.
+ last_vacuum_index_scans | 0
The number of index scans performed in the last completed
(auto)vacuum.
+ last_vacuum_oldest_xmin | 2134
The oldest xmin used in the last completed (auto)vacuum.
+ last_vacuum_status | agressive vacuum completed
The finish status of the last vacuum. Takes the following
values. (pg_stat_get_last_vacuum_status())
- completed
The last partial (auto)vacuum is completed.
- vacuum full completed
The last VACUUM FULL is completed.
- aggressive vacuum completed
The last aggressive (auto)vacuum is completed.
- error while $progress
The last vacuum stopped by error while $progress.
The $progress one of the vacuum progress phases.
- canceled while $progress
The last vacuum was canceled while $progress
This is caused by user cancellation of manual vacuum or
killed by another backend who wants to acquire lock on the
relation.
- skipped - lock unavailable
The last autovacuum on the relation was skipped because
required lock was not available.
- unvacuumable
A past autovacuum tried vacuum on the relation but it is not
vacuumable for reasons of ownership or accessibility problem.
(Such relations are not shown in pg_stat_all_tables..)
+ autovacuum_fail_count | 0
The number of successive failure of vacuum on the relation.
Reset to zero by completed vacuum.
======
In the patch, vacrelstats if pointed from a static variable and
cancel reporting is performed in PG_CATCH() section in vacuum().
Every unthrown error like lock acquisition failure is reported by
explicit pgstat_report_vacuum() with the corresponding finish
code.
Vacuum requirement status is calculated in AutoVacuumRequirment()
and returned as a string. Access share lock on the target
relation is required but it returns only available values if the
lock is not available. I decided to return incomplete (coarse
grained) result than wait for a lock that isn't known to be
relased in a short time for a perfect result.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
This is just a repost as a (true) new thread.
At Mon, 30 Oct 2017 20:57:50 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171030.205750.246076862.horiguchi.kyotaro@lab.ntt.co.jp>
At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
n_mod_since_analyze | 20000 + vacuum_requred | true + last_vacuum_oldest_xid | 8023 + last_vacuum_left_to_truncate | 5123 + last_vacuum_truncated | 387 last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflict ... autovacuum_count | 128 + incomplete_autovacuum_count | 53# The last one might be needless..
I'm not sure that the above informations will help for users or DBA
but personally I sometimes want to have the number of index scans of
the last autovacuum in the pg_stat_user_tables view. That value
indicates how efficiently vacuums performed and would be a signal to
increase the setting of autovacuum_work_mem for user.Btree and all existing index AMs (except brin) seem to visit the
all pages in every index scan so it would be valuable. Instead
the number of visited index pages during a table scan might be
usable. It is more relevant to performance than the number of
scans, on the other hand it is a bit difficult to get something
worth from the number in a moment. I'll show the number of scans
in the first cut.Where the "Killed by lock conflict" is one of the followings.
- Completed
- Truncation skipped
- Partially truncated
- Skipped
- Killed by lock conflictThis seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.I'm going to make a patch to do the 'formal' one for the time
being.
Done with small modifications. In the attached patch
pg_stat_all_tables has the following new columns. Documentations
is not provided at this stage.
-----
n_mod_since_analyze | 0
+ vacuum_required | not requried
last_vacuum |
last_autovacuum | 2017-10-30 18:51:32.060551+09
last_analyze |
last_autoanalyze | 2017-10-30 18:48:33.414711+09
vacuum_count | 0
+ last_vacuum_truncated | 0
+ last_vacuum_untruncated | 0
+ last_vacuum_index_scans | 0
+ last_vacuum_oldest_xmin | 2134
+ last_vacuum_status | agressive vacuum completed
+ autovacuum_fail_count | 0
autovacuum_count | 5
analyze_count | 0
autoanalyze_count | 1
-----
Where each column shows the following infomation.
+ vacuum_required | not requried
VACUUM requirement status. Takes the following values.
- partial
Partial (or normal) will be performed by the next autovacuum.
The word "partial" is taken from the comment for
vacuum_set_xid_limits.
- aggressive
Aggressive scan will be performed by the next autovacuum.
- required
Any type of autovacuum will be performed. The type of scan is
unknown because the view failed to take the required lock on
the table. (AutoVacuumrequirement())
- not required
Next autovacuum won't perform scan on this relation.
- not required (lock not acquired)
Autovacuum should be disabled and the distance to
freeze-limit is not known because required lock is not
available.
- close to freeze-limit xid
Shown while autovacuum is disabled. The table is in the
manual vacuum window to avoid anti-wraparound autovacuum.
+ last_vacuum_truncated | 0
The number of truncated pages in the last completed
(auto)vacuum.
+ last_vacuum_untruncated | 0
The number of pages the last completed (auto)vacuum tried to
truncate but could not for some reason.
+ last_vacuum_index_scans | 0
The number of index scans performed in the last completed
(auto)vacuum.
+ last_vacuum_oldest_xmin | 2134
The oldest xmin used in the last completed (auto)vacuum.
+ last_vacuum_status | agressive vacuum completed
The finish status of the last vacuum. Takes the following
values. (pg_stat_get_last_vacuum_status())
- completed
The last partial (auto)vacuum is completed.
- vacuum full completed
The last VACUUM FULL is completed.
- aggressive vacuum completed
The last aggressive (auto)vacuum is completed.
- error while $progress
The last vacuum stopped by error while $progress.
The $progress one of the vacuum progress phases.
- canceled while $progress
The last vacuum was canceled while $progress
This is caused by user cancellation of manual vacuum or
killed by another backend who wants to acquire lock on the
relation.
- skipped - lock unavailable
The last autovacuum on the relation was skipped because
required lock was not available.
- unvacuumable
A past autovacuum tried vacuum on the relation but it is not
vacuumable for reasons of ownership or accessibility problem.
(Such relations are not shown in pg_stat_all_tables..)
+ autovacuum_fail_count | 0
The number of successive failure of vacuum on the relation.
Reset to zero by completed vacuum.
======
In the patch, vacrelstats if pointed from a static variable and
cancel reporting is performed in PG_CATCH() section in vacuum().
Every unthrown error like lock acquisition failure is reported by
explicit pgstat_report_vacuum() with the corresponding finish
code.
Vacuum requirement status is calculated in AutoVacuumRequirment()
and returned as a string. Access share lock on the target
relation is required but it returns only available values if the
lock is not available. I decided to return incomplete (coarse
grained) result than wait for a lock that isn't known to be
relased in a short time for a perfect result.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Resolved by subject fallback
On Mon, Oct 30, 2017 at 8:57 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Thu, 26 Oct 2017 15:06:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171026.150630.115694437.horiguchi.kyotaro@lab.ntt.co.jp>
At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
n_mod_since_analyze | 20000 + vacuum_requred | true + last_vacuum_oldest_xid | 8023 + last_vacuum_left_to_truncate | 5123 + last_vacuum_truncated | 387 last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum | 2017-10-10 17:21:54.380805+09 + last_autovacuum_status | Killed by lock conflict ... autovacuum_count | 128 + incomplete_autovacuum_count | 53# The last one might be needless..
I'm not sure that the above informations will help for users or DBA
but personally I sometimes want to have the number of index scans of
the last autovacuum in the pg_stat_user_tables view. That value
indicates how efficiently vacuums performed and would be a signal to
increase the setting of autovacuum_work_mem for user.Btree and all existing index AMs (except brin) seem to visit the
all pages in every index scan so it would be valuable. Instead
the number of visited index pages during a table scan might be
usable. It is more relevant to performance than the number of
scans, on the other hand it is a bit difficult to get something
worth from the number in a moment. I'll show the number of scans
in the first cut.Where the "Killed by lock conflict" is one of the followings.
- Completed
- Truncation skipped
- Partially truncated
- Skipped
- Killed by lock conflictThis seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.I'm going to make a patch to do the 'formal' one for the time
being.Done with small modifications. In the attached patch
pg_stat_all_tables has the following new columns. Documentations
is not provided at this stage.----- n_mod_since_analyze | 0 + vacuum_required | not requried last_vacuum | last_autovacuum | 2017-10-30 18:51:32.060551+09 last_analyze | last_autoanalyze | 2017-10-30 18:48:33.414711+09 vacuum_count | 0 + last_vacuum_truncated | 0 + last_vacuum_untruncated | 0 + last_vacuum_index_scans | 0 + last_vacuum_oldest_xmin | 2134 + last_vacuum_status | agressive vacuum completed + autovacuum_fail_count | 0 autovacuum_count | 5 analyze_count | 0 autoanalyze_count | 1 ----- Where each column shows the following infomation.+ vacuum_required | not requried
VACUUM requirement status. Takes the following values.
- partial
Partial (or normal) will be performed by the next autovacuum.
The word "partial" is taken from the comment for
vacuum_set_xid_limits.- aggressive
Aggressive scan will be performed by the next autovacuum.- required
Any type of autovacuum will be performed. The type of scan is
unknown because the view failed to take the required lock on
the table. (AutoVacuumrequirement())- not required
Next autovacuum won't perform scan on this relation.- not required (lock not acquired)
Autovacuum should be disabled and the distance to
freeze-limit is not known because required lock is not
available.- close to freeze-limit xid
Shown while autovacuum is disabled. The table is in the
manual vacuum window to avoid anti-wraparound autovacuum.+ last_vacuum_truncated | 0
The number of truncated pages in the last completed
(auto)vacuum.+ last_vacuum_untruncated | 0
The number of pages the last completed (auto)vacuum tried to
truncate but could not for some reason.+ last_vacuum_index_scans | 0
The number of index scans performed in the last completed
(auto)vacuum.+ last_vacuum_oldest_xmin | 2134
The oldest xmin used in the last completed (auto)vacuum.+ last_vacuum_status | agressive vacuum completed
The finish status of the last vacuum. Takes the following
values. (pg_stat_get_last_vacuum_status())- completed
The last partial (auto)vacuum is completed.- vacuum full completed
The last VACUUM FULL is completed.- aggressive vacuum completed
The last aggressive (auto)vacuum is completed.- error while $progress
The last vacuum stopped by error while $progress.
The $progress one of the vacuum progress phases.- canceled while $progress
The last vacuum was canceled while $progressThis is caused by user cancellation of manual vacuum or
killed by another backend who wants to acquire lock on the
relation.- skipped - lock unavailable
The last autovacuum on the relation was skipped because
required lock was not available.- unvacuumable
A past autovacuum tried vacuum on the relation but it is not
vacuumable for reasons of ownership or accessibility problem.
(Such relations are not shown in pg_stat_all_tables..)+ autovacuum_fail_count | 0
The number of successive failure of vacuum on the relation.
Reset to zero by completed vacuum.======
In the patch, vacrelstats if pointed from a static variable and
cancel reporting is performed in PG_CATCH() section in vacuum().
Every unthrown error like lock acquisition failure is reported by
explicit pgstat_report_vacuum() with the corresponding finish
code.Vacuum requirement status is calculated in AutoVacuumRequirment()
and returned as a string. Access share lock on the target
relation is required but it returns only available values if the
lock is not available. I decided to return incomplete (coarse
grained) result than wait for a lock that isn't known to be
relased in a short time for a perfect result.
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
+ pg_stat_get_last_vacuum_truncated(C.oid) AS last_vacuum_truncated,
+ pg_stat_get_last_vacuum_untruncated(C.oid) AS
last_vacuum_untruncated,
+ pg_stat_get_last_vacuum_index_scans(C.oid) AS
last_vacuum_index_scans,
+ pg_stat_get_last_vacuum_oldest_xmin(C.oid) AS
last_vacuum_oldest_xmin,
+ pg_stat_get_last_vacuum_status(C.oid) AS last_vacuum_status,
+ pg_stat_get_autovacuum_fail_count(C.oid) AS autovacuum_fail_count,
Please use spaces instead of tabs. Indentation is not consistent.
+ case PGSTAT_VACUUM_CANCELED:
+ phase = tabentry->vacuum_last_phase;
+ /* number of elements of phasestr above */
+ if (phase >= 0 && phase <= 7)
+ result = psprintf("%s while %s",
+ status == PGSTAT_VACUUM_CANCELED ?
+ "canceled" : "error",
+ phasestr[phase]);
Such complication is not necessary. The phase parameter is updated by
individual calls of pgstat_progress_update_param(), so the information
showed here overlaps with the existing information in the "phase"
field.
@@ -210,7 +361,6 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
PG_RETURN_INT64(result);
}
-
Datum
pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
Noise diff.
Thinking about trying to et something into core by the end of the
commit fest, this patch presents multiple concepts at once which could
be split into separate patches for simplicity:
1) Additional data fields to help in debugging completed vacuums.
2) Tracking of interrupted vacuum jobs in progress table.
3) Get state of vacuum job on error.
However, progress reports are here to allow users to do decisions
based on the activity of how things are working. This patch proposes
to add multiple new fields:
- oldest Xmin.
- number of index scans.
- number of pages truncated.
- number of pages that should have been truncated, but are not truncated.
Among all this information, as Sawada-san has already mentioned
upthread, the more index scans the less dead tuples you can store at
once, so autovacuum_work_mem ought to be increases. This is useful for
tuning and should be documented properly if reported to give
indications about vacuum behavior. The rest though, could indicate how
aggressive autovacuum is able to remove tail blocks and do its work.
But what really matters for users to decide if autovacuum should be
more aggressive is tracking the number of dead tuples, something which
is already evaluated.
Tracking the number of failed vacuum attempts is also something
helpful to understand how much the job is able to complete. As there
is already tracking vacuum jobs that have completed, it could be
possible, instead of logging activity when a vacuum job has failed, to
track the number of *begun* jobs on a relation. Then it is possible to
guess how many have failed by taking the difference between those that
completed properly. Having counters per failure types could also be a
possibility.
For this commit fest, I would suggest a patch that simply adds
tracking for the number of index scans done, with documentation to
give recommendations about parameter tuning. i am switching the patch
as "waiting on author".
--
Michael
Thank you for reviewing this.
At Wed, 15 Nov 2017 16:13:01 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com>
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
Please use spaces instead of tabs. Indentation is not consistent.
Done. Thank you for pointing. (whitespace-mode showed me some
similar inconsistencies at the other places in the file...)
+ case PGSTAT_VACUUM_CANCELED: + phase = tabentry->vacuum_last_phase; + /* number of elements of phasestr above */ + if (phase >= 0 && phase <= 7) + result = psprintf("%s while %s", + status == PGSTAT_VACUUM_CANCELED ? + "canceled" : "error", + phasestr[phase]); Such complication is not necessary. The phase parameter is updated by individual calls of pgstat_progress_update_param(), so the information showed here overlaps with the existing information in the "phase" field.
The "phase" is pg_stat_progress_vacuum's? If "complexy" means
phasestr[phase], the "phase" cannot be overlap with
last_vacuum_status since pg_stat_progress_vacuum's entry has
already gone when someone looks into pg_stat_all_tables and see a
failed vacuum status. Could you give a bit specific comment?
@@ -210,7 +361,6 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
PG_RETURN_INT64(result);
}-
Datum
pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
Noise diff.
Removed.
Thinking about trying to et something into core by the end of the
commit fest, this patch presents multiple concepts at once which could
be split into separate patches for simplicity:
1) Additional data fields to help in debugging completed vacuums.
2) Tracking of interrupted vacuum jobs in progress table.
3) Get state of vacuum job on error.However, progress reports are here to allow users to do decisions
based on the activity of how things are working. This patch proposes
to add multiple new fields:
- oldest Xmin.
- number of index scans.
- number of pages truncated.
- number of pages that should have been truncated, but are not truncated.
Among all this information, as Sawada-san has already mentioned
upthread, the more index scans the less dead tuples you can store at
once, so autovacuum_work_mem ought to be increases. This is useful for
tuning and should be documented properly if reported to give
indications about vacuum behavior. The rest though, could indicate how
aggressive autovacuum is able to remove tail blocks and do its work.
But what really matters for users to decide if autovacuum should be
more aggressive is tracking the number of dead tuples, something which
is already evaluated.
Hmm. I tend to agree. Such numbers are better to be shown as
average of the last n vacuums or maximum. I decided to show
last_vacuum_index_scan only and I think that someone can record
it continuously to elsewhere if wants.
Tracking the number of failed vacuum attempts is also something
helpful to understand how much the job is able to complete. As there
is already tracking vacuum jobs that have completed, it could be
possible, instead of logging activity when a vacuum job has failed, to
track the number of *begun* jobs on a relation. Then it is possible to
guess how many have failed by taking the difference between those that
completed properly. Having counters per failure types could also be a
possibility.
Maybe pg_stat_all_tables is not the place to hold such many kinds
of vacuum specific information. pg_stat_vacuum_all_tables or
something like?
For this commit fest, I would suggest a patch that simply adds
tracking for the number of index scans done, with documentation to
give recommendations about parameter tuning. i am switching the patch
as "waiting on author".
Ok, the patch has been split into the following four parts. (Not
split by function, but by the kind of information to add.)
The first one is that.
0001. Adds pg_stat_all_tables.last_vacuum_index_scans. Documentation is added.
0002. Adds pg_stat_all_tables.vacuum_required. And primitive documentation.
0003. Adds pg_stat_all_tables.last_vacuum_status/autovacuum_fail_count
plus primitive documentation.
0004. truncation information stuff.
One concern on pg_stat_all_tables view is the number of
predefined functions it uses. Currently 20 functions and this
patch adds more seven. I feel it's better that at least the
functions this patch adds are merged into one function..
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Thu, Nov 16, 2017 at 7:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Wed, 15 Nov 2017 16:13:01 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com>
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
Please use spaces instead of tabs. Indentation is not consistent.Done. Thank you for pointing. (whitespace-mode showed me some
similar inconsistencies at the other places in the file...)
Yes, I am aware of those which get introduced here and there. Let's
not make things worse..
+ case PGSTAT_VACUUM_CANCELED: + phase = tabentry->vacuum_last_phase; + /* number of elements of phasestr above */ + if (phase >= 0 && phase <= 7) + result = psprintf("%s while %s", + status == PGSTAT_VACUUM_CANCELED ? + "canceled" : "error", + phasestr[phase]); Such complication is not necessary. The phase parameter is updated by individual calls of pgstat_progress_update_param(), so the information showed here overlaps with the existing information in the "phase" field.The "phase" is pg_stat_progress_vacuum's? If "complexy" means
phasestr[phase], the "phase" cannot be overlap with
last_vacuum_status since pg_stat_progress_vacuum's entry has
already gone when someone looks into pg_stat_all_tables and see a
failed vacuum status. Could you give a bit specific comment?
I mean that if you tend to report this information, you should just
use a separate column for it. Having a single column report two
informations, which are here the type of error and potentially the
moment where it appeared are harder to parse.
However, progress reports are here to allow users to do decisions
based on the activity of how things are working. This patch proposes
to add multiple new fields:
- oldest Xmin.
- number of index scans.
- number of pages truncated.
- number of pages that should have been truncated, but are not truncated.
Among all this information, as Sawada-san has already mentioned
upthread, the more index scans the less dead tuples you can store at
once, so autovacuum_work_mem ought to be increases. This is useful for
tuning and should be documented properly if reported to give
indications about vacuum behavior. The rest though, could indicate how
aggressive autovacuum is able to remove tail blocks and do its work.
But what really matters for users to decide if autovacuum should be
more aggressive is tracking the number of dead tuples, something which
is already evaluated.Hmm. I tend to agree. Such numbers are better to be shown as
average of the last n vacuums or maximum. I decided to show
last_vacuum_index_scan only and I think that someone can record
it continuously to elsewhere if wants.
As a user, what would you make of those numbers? How would they help
in tuning autovacuum for a relation? We need to clear up those
questions before thinking if there are cases where those are useful.
Tracking the number of failed vacuum attempts is also something
helpful to understand how much the job is able to complete. As there
is already tracking vacuum jobs that have completed, it could be
possible, instead of logging activity when a vacuum job has failed, to
track the number of *begun* jobs on a relation. Then it is possible to
guess how many have failed by taking the difference between those that
completed properly. Having counters per failure types could also be a
possibility.Maybe pg_stat_all_tables is not the place to hold such many kinds
of vacuum specific information. pg_stat_vacuum_all_tables or
something like?
What do you have in mind? pg_stat_all_tables already includes counters
about the number of vacuums and analyze runs completed. I guess that
the number of failures, and the types of failures ought to be similar
counters at the same level.
For this commit fest, I would suggest a patch that simply adds
tracking for the number of index scans done, with documentation to
give recommendations about parameter tuning. i am switching the patch
as "waiting on author".Ok, the patch has been split into the following four parts. (Not
split by function, but by the kind of information to add.)
The first one is that.0001. Adds pg_stat_all_tables.last_vacuum_index_scans. Documentation is added.
0002. Adds pg_stat_all_tables.vacuum_required. And primitive documentation.
0003. Adds pg_stat_all_tables.last_vacuum_status/autovacuum_fail_count
plus primitive documentation.0004. truncation information stuff.
One concern on pg_stat_all_tables view is the number of
predefined functions it uses. Currently 20 functions and this
patch adds more seven. I feel it's better that at least the
functions this patch adds are merged into one function..
For the scope of this commit fest, why not focusing only on 0001 with
the time that remains? This at least is something I am sure will be
useful.
<para>
+ Vacuuming scans all index pages to remove index entries that pointed
+ the removed tuples. In order to finish vacuuming by as few index
+ scans as possible, the removed tuples are remembered in working
+ memory. If this setting is not large enough, vacuuming runs
+ additional index scans to vacate the memory and it might cause a
+ performance problem. That behavior can be monitored
+ in <xref linkend="pg-stat-all-tables-view">.
+ </para>
Why not making that the third paragraph, after autovacuum_work_mem has
been mentioned for the first time? This could be reworded as well.
Short idea:
Vacuum scans all index pages to remove index entries that pointed to
dead tuples. Finishing vacuum with a minimal number of index scans
reduces the time it takes to complete it, and a new scan is triggered
once the in-memory storage for dead tuple pointers gets full, whose
size is defined by autovacuum_work_mem. So increasing this parameter
can make the operation finish more quickly. This can be monitored with
pg_stat_all_tables.
pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
+ pg_stat_get_last_vacuum_index_scans(C.oid) AS
last_vacuum_index_scans,
pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
Counters with counters, and last vacuum info with last vacuum info, no?
--
Michael
Thank you for the comments.
At Sat, 18 Nov 2017 22:23:20 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQV1Emkj=5VFzui250T6v+xcpRQ2RfHu_oQMbdXnZw3mA@mail.gmail.com>
On Thu, Nov 16, 2017 at 7:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:At Wed, 15 Nov 2017 16:13:01 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com>
Please use spaces instead of tabs. Indentation is not consistent.
Done. Thank you for pointing. (whitespace-mode showed me some
similar inconsistencies at the other places in the file...)Yes, I am aware of those which get introduced here and there. Let's
not make things worse..
Year, I agree with it.
+ case PGSTAT_VACUUM_CANCELED: + phase = tabentry->vacuum_last_phase; + /* number of elements of phasestr above */ + if (phase >= 0 && phase <= 7) + result = psprintf("%s while %s", + status == PGSTAT_VACUUM_CANCELED ? + "canceled" : "error", + phasestr[phase]); Such complication is not necessary. The phase parameter is updated by individual calls of pgstat_progress_update_param(), so the information showed here overlaps with the existing information in the "phase" field.The "phase" is pg_stat_progress_vacuum's? If "complexy" means
phasestr[phase], the "phase" cannot be overlap with
last_vacuum_status since pg_stat_progress_vacuum's entry has
already gone when someone looks into pg_stat_all_tables and see a
failed vacuum status. Could you give a bit specific comment?I mean that if you tend to report this information, you should just
use a separate column for it. Having a single column report two
informations, which are here the type of error and potentially the
moment where it appeared are harder to parse.
Thanks for the explanation. Ok, now "last_vacuum_status" just
show how the last vacuum or autovacuum finished, in "completed",
"error", "canceled" and "skipped". "last_vacuum_status_detail"
shows the phase at exiting if "error" or "canceled". They are
still in a bit complex relationship. (pgstatfuncs.c) "error" and
"cancel" could be unified since the error code is already shown
in log.
last_vac_status | last_vac_stat_detail
================+=======================
"completed" | (null)/"aggressive"/"full" + "partially truncated" +
| "not a target"
"skipped" | "lock failure"
"error" | <errcode> + <phase>
"canceled" | <phase>
However, progress reports are here to allow users to do decisions
based on the activity of how things are working. This patch proposes
to add multiple new fields:
- oldest Xmin.
- number of index scans.
- number of pages truncated.
- number of pages that should have been truncated, but are not truncated.
Among all this information, as Sawada-san has already mentioned
upthread, the more index scans the less dead tuples you can store at
once, so autovacuum_work_mem ought to be increases. This is useful for
tuning and should be documented properly if reported to give
indications about vacuum behavior. The rest though, could indicate how
aggressive autovacuum is able to remove tail blocks and do its work.
But what really matters for users to decide if autovacuum should be
more aggressive is tracking the number of dead tuples, something which
is already evaluated.Hmm. I tend to agree. Such numbers are better to be shown as
average of the last n vacuums or maximum. I decided to show
last_vacuum_index_scan only and I think that someone can record
it continuously to elsewhere if wants.As a user, what would you make of those numbers? How would they help
in tuning autovacuum for a relation? We need to clear up those
questions before thinking if there are cases where those are useful.
Ah, I found what you meant. The criteria to choose the numbers in
the previous patch was just what is not logged, and usable to
find whether something wrong is happening on vacuum. So # of
index scans was not in the list. The objective here is to find
the health of vacuum just by looking into stats views.
vacuum_required: apparently cannot be logged, and it is not so
easy to calculate.
last_vacuum_index_scans: It is shown in the log, but I agree that it
is usable to find maintenance_work_mem is too small.
last_vacuum_status: It is logged, but it is likely for users to
examine it after something bad has happend.
"complete" in this column immediately shows that
vacuum on the table is perfectly working.
last_vacuum_status_detail: The cause of cancel or skipping is not
logged but always it is hard to find out what is
wrong. This narrows the area for users and/or support
to investigate.
autovacuum_fail_count: When vacuum has not executed for a long
time, users cannot tell wheter vacuum is not
required at all or vacuum trials have been
skipped/canceled. This makes distinction between
the two cases.
last_vacuum_untruncated: This is not shown in a log entry. Uses can
find that trailing empty pages are left untruncted.
last_vacuum_truncated: This is shown in the log. This is just here in
order to be compared to untruncte since # untruncated
solely doesn't have meaning.
Or conversely can find that relations are
*unwantedly* truncated (as my understanding of the
suggestion from Alvaro)
last_vacuum_oldest_xmin: A problem very frequently happens is table
bloat caused by long transactions.
Tracking the number of failed vacuum attempts is also something
helpful to understand how much the job is able to complete. As there
is already tracking vacuum jobs that have completed, it could be
possible, instead of logging activity when a vacuum job has failed, to
track the number of *begun* jobs on a relation. Then it is possible to
guess how many have failed by taking the difference between those that
completed properly. Having counters per failure types could also be a
possibility.Maybe pg_stat_all_tables is not the place to hold such many kinds
of vacuum specific information. pg_stat_vacuum_all_tables or
something like?What do you have in mind? pg_stat_all_tables already includes counters
Nothing specific in my mind.
about the number of vacuums and analyze runs completed. I guess that
the number of failures, and the types of failures ought to be similar
counters at the same level.
Yes, my concern here is how many column we can allow in a stats
view. I think I'm a bit too warried about that.
For this commit fest, I would suggest a patch that simply adds
tracking for the number of index scans done, with documentation to
give recommendations about parameter tuning. i am switching the patch
as "waiting on author".Ok, the patch has been split into the following four parts. (Not
split by function, but by the kind of information to add.)
The first one is that.0001. Adds pg_stat_all_tables.last_vacuum_index_scans. Documentation is added.
0002. Adds pg_stat_all_tables.vacuum_required. And primitive documentation.
0003. Adds pg_stat_all_tables.last_vacuum_status/autovacuum_fail_count
plus primitive documentation.0004. truncation information stuff.
One concern on pg_stat_all_tables view is the number of
predefined functions it uses. Currently 20 functions and this
patch adds more seven. I feel it's better that at least the
functions this patch adds are merged into one function..For the scope of this commit fest, why not focusing only on 0001 with
the time that remains? This at least is something I am sure will be
useful.
Year, so I separated the 0001 patch, but it was not my intention in
this thread. It was 0002 and 0003 so I'd like *show* them with 0001
and focusing on 0001 for this commit fest is fine to me.
<para> + Vacuuming scans all index pages to remove index entries that pointed + the removed tuples. In order to finish vacuuming by as few index + scans as possible, the removed tuples are remembered in working + memory. If this setting is not large enough, vacuuming runs + additional index scans to vacate the memory and it might cause a + performance problem. That behavior can be monitored + in <xref linkend="pg-stat-all-tables-view">. + </para> Why not making that the third paragraph, after autovacuum_work_mem has been mentioned for the first time? This could be reworded as well.
Just to place the Note at the last paragrah. The Note is mentioning
multiplication of autovacuum_work_mem, not about the guc
itself. Anyway I swapped them in this version.
Short idea:
Vacuum scans all index pages to remove index entries that pointed to
dead tuples. Finishing vacuum with a minimal number of index scans
reduces the time it takes to complete it, and a new scan is triggered
once the in-memory storage for dead tuple pointers gets full, whose
size is defined by autovacuum_work_mem. So increasing this parameter
can make the operation finish more quickly. This can be monitored with
pg_stat_all_tables.
I thought that it *must* be reworded anyway (because of my poor
wording). Thanks for rewording. I find this perfect.
pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
+ pg_stat_get_last_vacuum_index_scans(C.oid) AS
last_vacuum_index_scans,
pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
Counters with counters, and last vacuum info with last vacuum info, no?
Moved it to above vacuum_count.
By the way I'm uneasy that the 'last_vacuum_index_scans' (and
vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
both VACUUM command and autovacuum, while last_vacuum and
vacuum_count is mentioning only the command. Splitting it into
vacuum/autovaccum seems nonsense but the name is confusing. Do
you have any idea?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
By the way I'm uneasy that the 'last_vacuum_index_scans' (and
vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
both VACUUM command and autovacuum, while last_vacuum and
vacuum_count is mentioning only the command. Splitting it into
vacuum/autovaccum seems nonsense but the name is confusing. Do
you have any idea?
Hm. I think that you should actually have two fields, one for manual
vacuum and one for autovacuum, because each is tied to respectively
maintenance_work_mem and autovacuum_work_mem. This way admins are able
to tune each one of those parameters depending on a look at
pg_stat_all_tables. So those should be named perhaps
last_vacuum_index_scans and last_autovacuum_index_scans?
--
Michael
Hello,
At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ03JrEwKqbc0fWJe9Lt1-fAQc961OWw+Upw9QmRXak0A@mail.gmail.com>
On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:By the way I'm uneasy that the 'last_vacuum_index_scans' (and
vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
both VACUUM command and autovacuum, while last_vacuum and
vacuum_count is mentioning only the command. Splitting it into
vacuum/autovaccum seems nonsense but the name is confusing. Do
you have any idea?Hm. I think that you should actually have two fields, one for manual
vacuum and one for autovacuum, because each is tied to respectively
maintenance_work_mem and autovacuum_work_mem. This way admins are able
It's very convincing for me. Thanks for the suggestion.
to tune each one of those parameters depending on a look at
pg_stat_all_tables. So those should be named perhaps
last_vacuum_index_scans and last_autovacuum_index_scans?
Agreed. I'll do so in the next version.
# I forgot to add the version to the patch files...
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Nov 22, 2017 at 1:08 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQ03JrEwKqbc0fWJe9Lt1-fAQc961OWw+Upw9QmRXak0A@mail.gmail.com>
On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:By the way I'm uneasy that the 'last_vacuum_index_scans' (and
vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
both VACUUM command and autovacuum, while last_vacuum and
vacuum_count is mentioning only the command. Splitting it into
vacuum/autovaccum seems nonsense but the name is confusing. Do
you have any idea?Hm. I think that you should actually have two fields, one for manual
vacuum and one for autovacuum, because each is tied to respectively
maintenance_work_mem and autovacuum_work_mem. This way admins are ableIt's very convincing for me. Thanks for the suggestion.
to tune each one of those parameters depending on a look at
pg_stat_all_tables. So those should be named perhaps
last_vacuum_index_scans and last_autovacuum_index_scans?Agreed. I'll do so in the next version.
Thanks for considering the suggestion.
# I forgot to add the version to the patch files...
Don't worry about that. That's not a problem for me I'll just keep
track of the last entry. With the room I have I'll keep focused on
0001 by the way. Others are of course welcome to look at 0002 and
onwards.
--
Michael
On Tue, Nov 21, 2017 at 2:09 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Yes, my concern here is how many column we can allow in a stats
view. I think I'm a bit too warried about that.
I think that's a good thing to worry about. In the past, Tom has
expressed reluctance to make stats tables that have a row per table
any wider at all, IIRC.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 21, 2017 at 2:09 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Yes, my concern here is how many column we can allow in a stats
view. I think I'm a bit too warried about that.I think that's a good thing to worry about. In the past, Tom has
expressed reluctance to make stats tables that have a row per table
any wider at all, IIRC.
Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
columns (this takes two full lines on my terminal with a font size at
13). With the first patch of what's proposed on this thread there
would be 24 columns. Perhaps it would be time to split the vacuum
statistics into a new view like pg_stat_tables_vacuum or similar?
--
Michael
Michael Paquier <michael.paquier@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that's a good thing to worry about. In the past, Tom has
expressed reluctance to make stats tables that have a row per table
any wider at all, IIRC.
Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
columns (this takes two full lines on my terminal with a font size at
13). With the first patch of what's proposed on this thread there
would be 24 columns. Perhaps it would be time to split the vacuum
statistics into a new view like pg_stat_tables_vacuum or similar?
My concern is not with the width of any view --- you can always select a
subset of columns if a view is too wide for your screen. The issue is the
number of counters in the stats collector's state. We know, without any
doubt, that widening PgStat_StatTabEntry causes serious pain to people
with large numbers of tables; and they have no recourse when we do it.
So the bar to adding more counters is very high IMO. I won't say never,
but I do doubt that something like skipped vacuums should make the cut.
If we could get rid of the copy-to-a-temporary-file technology for
transferring the stats collector's data to backends, then this problem
would probably vanish or at least get a lot less severe. But that seems
like a nontrivial project. With the infrastructure we have today, we
could probably keep the stats tables in a DSM segment; but how would
a backend get a consistent snapshot of them?
regards, tom lane
On Sat, Nov 25, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If we could get rid of the copy-to-a-temporary-file technology for
transferring the stats collector's data to backends, then this problem
would probably vanish or at least get a lot less severe. But that seems
like a nontrivial project. With the infrastructure we have today, we
could probably keep the stats tables in a DSM segment; but how would
a backend get a consistent snapshot of them?
I suppose the obvious approach is to have a big lock around the
statistics data proper; this could be taken in shared mode to take a
snapshot or in exclusive mode to update statistics. In addition,
create one or more queues where statistics messages can be enqueued in
lieu of updating the main statistics data directly. If that doesn't
perform well enough, you could keep two copies of the statistics, A
and B. At any given time, one copy is quiescent and the other copy is
being updated. Periodically, at a time when we know that nobody is
taking a snapshot of the statistics, they reverse roles.
Of course, the other obvious question is whether we really need a
consistent snapshot, because that's bound to be pretty expensive even
if you eliminate the I/O cost. Taking a consistent snapshot across
all 100,000 tables in the database even if we're only ever going to
access 5 of those tables doesn't seem like a good or scalable design.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
Of course, the other obvious question is whether we really need a
consistent snapshot, because that's bound to be pretty expensive even
if you eliminate the I/O cost. Taking a consistent snapshot across
all 100,000 tables in the database even if we're only ever going to
access 5 of those tables doesn't seem like a good or scalable design.
Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.
What we need is a way to have a consistent snapshot without implementing
it by copying 100,000 tables' worth of data for every query. Hmm, I heard
of a technique called MVCC once ...
regards, tom lane
On Sun, Nov 26, 2017 at 12:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that's a good thing to worry about. In the past, Tom has
expressed reluctance to make stats tables that have a row per table
any wider at all, IIRC.Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
columns (this takes two full lines on my terminal with a font size at
13). With the first patch of what's proposed on this thread there
would be 24 columns. Perhaps it would be time to split the vacuum
statistics into a new view like pg_stat_tables_vacuum or similar?My concern is not with the width of any view --- you can always select a
subset of columns if a view is too wide for your screen. The issue is the
number of counters in the stats collector's state. We know, without any
doubt, that widening PgStat_StatTabEntry causes serious pain to people
with large numbers of tables; and they have no recourse when we do it.
So the bar to adding more counters is very high IMO. I won't say never,
but I do doubt that something like skipped vacuums should make the cut.
I am not arguing about skipped vacuuum data here (don't think much of
it by the way), but of the number of index scans done by the last
vacuum or autovacuum. This helps in tunning autovacuum_work_mem and
maintenance_work_mem. The bar is still too high for that?
--
Michael
Michael Paquier <michael.paquier@gmail.com> writes:
I am not arguing about skipped vacuuum data here (don't think much of
it by the way), but of the number of index scans done by the last
vacuum or autovacuum. This helps in tunning autovacuum_work_mem and
maintenance_work_mem. The bar is still too high for that?
I'd say so ... that's something the average user will never bother with,
and even if they knew to bother, it's far from obvious what to do with
the information. Besides, I don't think you could just save the number
of scans and nothing else. For it to be meaningful, you'd at least have
to know the prevailing work_mem setting and the number of dead tuples
removed ... and then you'd need some info about your historical average
and maximum number of dead tuples removed, so that you knew whether the
last vacuum operation was at all representative. So this is sounding
like quite a lot of new counters, in support of perhaps 0.1% of the
user population. Most people are just going to set maintenance_work_mem
as high as they can tolerate and call it good.
regards, tom lane
On Sun, Nov 26, 2017 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'd say so ... that's something the average user will never bother with,
and even if they knew to bother, it's far from obvious what to do with
the information. Besides, I don't think you could just save the number
of scans and nothing else. For it to be meaningful, you'd at least have
to know the prevailing work_mem setting and the number of dead tuples
removed ... and then you'd need some info about your historical average
and maximum number of dead tuples removed, so that you knew whether the
last vacuum operation was at all representative. So this is sounding
like quite a lot of new counters, in support of perhaps 0.1% of the
user population. Most people are just going to set maintenance_work_mem
as high as they can tolerate and call it good.
In all the PostgreSQL deployments I deal with, the database is
embedded with other things running in parallel and memory is something
that's shared between components, so being able to tune more precisely
any of the *_work_mem parameters has value (a couple of applications
are also doing autovacuum tuning at relation-level). Would you think
that it is acceptable to add the number of index scans that happened
with the verbose output then? Personally I could live with that
information. I recall as well a thread about complains that VACUUM
VERBOSE is showing already too much information, I cannot put my
finger on it specifically now though. With
autovacuum_log_min_duration, it is easy enough to trace a vacuum
pattern. The thing is that for now the tuning is not that evident, and
CPU cycles can be worth saving in some deployments while memory could
be extended more easily.
--
Michael
Michael Paquier <michael.paquier@gmail.com> writes:
... Would you think
that it is acceptable to add the number of index scans that happened
with the verbose output then?
I don't have an objection to it, but can't you tell that from VACUUM
VERBOSE already? There should be a "INFO: scanned index" line for
each scan.
regards, tom lane
On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Of course, the other obvious question is whether we really need a
consistent snapshot, because that's bound to be pretty expensive even
if you eliminate the I/O cost. Taking a consistent snapshot across
all 100,000 tables in the database even if we're only ever going to
access 5 of those tables doesn't seem like a good or scalable design.Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.
You mean, queries against the stats views, or queries in general? If
the latter, by what mechanism would the breakage happen?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.
You mean, queries against the stats views, or queries in general? If
the latter, by what mechanism would the breakage happen?
Queries against the stats views, of course.
regards, tom lane
On Mon, Nov 27, 2017 at 2:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
... Would you think
that it is acceptable to add the number of index scans that happened
with the verbose output then?I don't have an objection to it, but can't you tell that from VACUUM
VERBOSE already? There should be a "INFO: scanned index" line for
each scan.
Of course, sorry for the noise.
--
Michael
On Mon, Nov 27, 2017 at 5:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.You mean, queries against the stats views, or queries in general? If
the latter, by what mechanism would the breakage happen?Queries against the stats views, of course.
There has been much discussion on this thread, and the set of patches
as presented may hurt performance for users with a large number of
tables, so I am marking it as returned with feedback.
--
Michael
At Mon, 27 Nov 2017 10:03:25 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTaVWd9vAjRzMOCKHP9k6ge-0u4w_7-YHKZ+gynGN8fpg@mail.gmail.com>
On Mon, Nov 27, 2017 at 5:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.You mean, queries against the stats views, or queries in general? If
the latter, by what mechanism would the breakage happen?Queries against the stats views, of course.
There has been much discussion on this thread, and the set of patches
as presented may hurt performance for users with a large number of
tables, so I am marking it as returned with feedback.
--
Michael
Hmmm. Okay, we must make stats collector more effeicient if we
want to have additional counters with smaller significance in the
table stats. Currently sizeof(PgStat_StatTabEntry) is 168
bytes. The whole of the patchset increases it to 232 bytes. Thus
the size of a stat file for a database with 10000 tables
increases from about 1.7MB to 2.4MB. DSM and shared dynahash is
not dynamically expandable so placing stats on shared hash
doesn't seem effective. Stats as a regular table could work but
it seems too-much.
Is it acceptable that adding a new section containing this new
counters, which is just loaded as a byte sequence and parsing
(and filling the corresponding hash) is postponed until a counter
in the section is really requested? The new counters need to be
shown in a separate stats view (maybe named pg_stat_vacuum).
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hmmm. Okay, we must make stats collector more effeicient if we
want to have additional counters with smaller significance in the
table stats. Currently sizeof(PgStat_StatTabEntry) is 168
bytes. The whole of the patchset increases it to 232 bytes. Thus
the size of a stat file for a database with 10000 tables
increases from about 1.7MB to 2.4MB. DSM and shared dynahash is
not dynamically expandable so placing stats on shared hash
doesn't seem effective. Stats as a regular table could work but
it seems too-much.
dshash, which is already committed, is both DSM-based and dynamically
expandable.
Is it acceptable that adding a new section containing this new
counters, which is just loaded as a byte sequence and parsing
(and filling the corresponding hash) is postponed until a counter
in the section is really requested? The new counters need to be
shown in a separate stats view (maybe named pg_stat_vacuum).
Still makes the stats file bigger.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.You mean, queries against the stats views, or queries in general? If
the latter, by what mechanism would the breakage happen?Queries against the stats views, of course.
Hmm. Those are probably rare. If we only took a snapshot of the
statistics for the backends that explicitly access those views, that
probably wouldn't be too crazy.
Sorry if this is a stupid question, but how often and for what purpose
to regular backends need the stats collector data for purposes other
than querying the stats views? I thought that the data was only used
to decide whether to VACUUM/ANALYZE, and therefore would be accessed
mostly by autovacuum, and for that you'd actually want the most
up-to-date view of the stats for a particular table that is available,
not any older snapshot.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Nov 27, 2017 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mumble. It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one. It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.You mean, queries against the stats views, or queries in general? If
the latter, by what mechanism would the breakage happen?Queries against the stats views, of course.
Hmm. Those are probably rare. If we only took a snapshot of the
statistics for the backends that explicitly access those views, that
probably wouldn't be too crazy.Sorry if this is a stupid question, but how often and for what purpose
to regular backends need the stats collector data for purposes other
than querying the stats views? I thought that the data was only used
to decide whether to VACUUM/ANALYZE, and therefore would be accessed
mostly by autovacuum, and for that you'd actually want the most
up-to-date view of the stats for a particular table that is available,
not any older snapshot.
Autovacuum resets the stats to make sure. Autovacuum in particular can
probably be made a lot more efficient, because it only ever looks at one
relation at a time, I think.
What I've been thinking about for that one before is if we could just
invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
collector for a single table or index stat. If autovacuum never needs to
see a consistent view between multiple tables, I would think that's going
to be a win in a lot of cases.
I don't think regular backends use them at all. But anybody looking at the
stats do, and it is pretty important there.
However, when it comes to the stats system, I'd say that on any busy system
(which would be the ones to care about), the stats structures are still
going to be *written* a lot more than they are read. We certainly don't
read them at the rate of once per transaction. A lot of the reads are also
limited to one database of course.
I wonder if we want to implement some sort of copy-on-read-snapshot in the
stats collector itself. So instead of unconditionally publishing
everything, have the backends ask for it. When a backend asks for it it
gets a "snapshot counter" or something from the stats collector, and on the
next write after that we do a copy-write if the snapshot it still
available. (no, i have not thought in detail)
Or -- if we keep a per-database hashtable in dynamic shared memory (which
we can now). Can we copy it into local memory in the backend fast enough
that we can hold a lock and just queue up the stats updates during the
copy? If we can copy the complete structure, that would fix one of the
bigger bottlenecks with it today which is that we dump and rebuild the
hashtables as we go through the tempfiles.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
Magnus Hagander <magnus@hagander.net> writes:
What I've been thinking about for that one before is if we could just
invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
collector for a single table or index stat. If autovacuum never needs to
see a consistent view between multiple tables, I would think that's going
to be a win in a lot of cases.
Perhaps. Autovac might run through quite a few tables before it finds
one in need of processing, though, so I'm not quite certain this would
yield such great benefits in isolation.
However, when it comes to the stats system, I'd say that on any busy system
(which would be the ones to care about), the stats structures are still
going to be *written* a lot more than they are read.
Uh, what? The stats collector doesn't write those files at all except
on-demand.
regards, tom lane
On Tue, Nov 28, 2017 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
What I've been thinking about for that one before is if we could just
invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
collector for a single table or index stat. If autovacuum never needs to
see a consistent view between multiple tables, I would think that's going
to be a win in a lot of cases.Perhaps. Autovac might run through quite a few tables before it finds
one in need of processing, though, so I'm not quite certain this would
yield such great benefits in isolation.
Hmm. Good point.
However, when it comes to the stats system, I'd say that on any busy
system
(which would be the ones to care about), the stats structures are still
going to be *written* a lot more than they are read.Uh, what? The stats collector doesn't write those files at all except
on-demand.
Oops. Missing one important word. They're going to be *written to* a lot
more than they are read. Meaning that each individual value is likely to be
updated many times before it's ever read. In memory, in the stats
collector. So not talking about the files at all -- just the numbers,
independent of implementation.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
At Mon, 27 Nov 2017 13:51:22 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmob2tuqvEZfHV2kLC-xobsZxDWGdc1WmjLg5+iOPLa0NHg@mail.gmail.com>
On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hmmm. Okay, we must make stats collector more effeicient if we
want to have additional counters with smaller significance in the
table stats. Currently sizeof(PgStat_StatTabEntry) is 168
bytes. The whole of the patchset increases it to 232 bytes. Thus
the size of a stat file for a database with 10000 tables
increases from about 1.7MB to 2.4MB. DSM and shared dynahash is
not dynamically expandable so placing stats on shared hash
doesn't seem effective. Stats as a regular table could work but
it seems too-much.dshash, which is already committed, is both DSM-based and dynamically
expandable.
Yes, I forgot about that. We can just copy memory blocks to take
a snapshot of stats.
Is it acceptable that adding a new section containing this new
counters, which is just loaded as a byte sequence and parsing
(and filling the corresponding hash) is postponed until a counter
in the section is really requested? The new counters need to be
shown in a separate stats view (maybe named pg_stat_vacuum).Still makes the stats file bigger.
I considered dshash for pgstat.c and the attached is a *PoC*
patch, which is not full-fledged and just working under a not so
concurrent situation.
- Made stats collector an auxiliary process. A crash of stats
collector leads to a whole-server restarting.
- dshash lacks capability of sequential scan so added it.
- Also added snapshot function to dshash. It just copies
unerlying DSA segments into local memory but currently it
doesn't aquire dshash-level locks at all. I tried the same
thing with resize but it leads to very quick exhaustion of
LWLocks. An LWLock for the whole dshash would be required. (and
it is also useful to resize() and sequential scan.
- The current dshash doesn't shrink at all. Such a feature also
will be required. (A server restart causes a shrink of hashes
in the same way as before but bloat dshash requires copying
more than necessary size of memory on takeing a snapshot.)
The size of DSA is about 1MB at minimum. Copying entry-by-entry
into (non-ds) hash might be better than copying underlying DSA as
a whole, and DSA/DSHASH snapshot brings a kind of dirty..
Does anyone give me opinions or suggestions?
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
On Mon, Dec 11, 2017 at 8:15 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Mon, 27 Nov 2017 13:51:22 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmob2tuqvEZfHV2kLC-xobsZxDWGdc1WmjLg5+iOPLa0NHg@mail.gmail.com>
On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hmmm. Okay, we must make stats collector more effeicient if we
want to have additional counters with smaller significance in the
table stats. Currently sizeof(PgStat_StatTabEntry) is 168
bytes. The whole of the patchset increases it to 232 bytes. Thus
the size of a stat file for a database with 10000 tables
increases from about 1.7MB to 2.4MB. DSM and shared dynahash is
not dynamically expandable so placing stats on shared hash
doesn't seem effective. Stats as a regular table could work but
it seems too-much.dshash, which is already committed, is both DSM-based and dynamically
expandable.Yes, I forgot about that. We can just copy memory blocks to take
a snapshot of stats.Is it acceptable that adding a new section containing this new
counters, which is just loaded as a byte sequence and parsing
(and filling the corresponding hash) is postponed until a counter
in the section is really requested? The new counters need to be
shown in a separate stats view (maybe named pg_stat_vacuum).Still makes the stats file bigger.
I considered dshash for pgstat.c and the attached is a *PoC*
patch, which is not full-fledged and just working under a not so
concurrent situation.- Made stats collector an auxiliary process. A crash of stats
collector leads to a whole-server restarting.- dshash lacks capability of sequential scan so added it.
- Also added snapshot function to dshash. It just copies
unerlying DSA segments into local memory but currently it
doesn't aquire dshash-level locks at all. I tried the same
thing with resize but it leads to very quick exhaustion of
LWLocks. An LWLock for the whole dshash would be required. (and
it is also useful to resize() and sequential scan.- The current dshash doesn't shrink at all. Such a feature also
will be required. (A server restart causes a shrink of hashes
in the same way as before but bloat dshash requires copying
more than necessary size of memory on takeing a snapshot.)The size of DSA is about 1MB at minimum. Copying entry-by-entry
into (non-ds) hash might be better than copying underlying DSA as
a whole, and DSA/DSHASH snapshot brings a kind of dirty..Does anyone give me opinions or suggestions?
The implementation of autovacuum work-item has been changed (by commit
31ae1638) because dynamic shared memory is not portable enough. IIUC
this patch is going to do the similar thing. Since stats collector is
also a critical part of the server, should we consider whether we can
change it? Or the portability problem is not relevant with this patch?
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
At Tue, 6 Feb 2018 14:50:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCRn6Q0wGG7UwGVsQJZbocNsRaZByJomUy+-GRkVH-i9A@mail.gmail.com>
On Mon, Dec 11, 2017 at 8:15 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:I considered dshash for pgstat.c and the attached is a *PoC*
patch, which is not full-fledged and just working under a not so
concurrent situation.- Made stats collector an auxiliary process. A crash of stats
collector leads to a whole-server restarting.- dshash lacks capability of sequential scan so added it.
- Also added snapshot function to dshash. It just copies
unerlying DSA segments into local memory but currently it
doesn't aquire dshash-level locks at all. I tried the same
thing with resize but it leads to very quick exhaustion of
LWLocks. An LWLock for the whole dshash would be required. (and
it is also useful to resize() and sequential scan.- The current dshash doesn't shrink at all. Such a feature also
will be required. (A server restart causes a shrink of hashes
in the same way as before but bloat dshash requires copying
more than necessary size of memory on takeing a snapshot.)The size of DSA is about 1MB at minimum. Copying entry-by-entry
into (non-ds) hash might be better than copying underlying DSA as
a whole, and DSA/DSHASH snapshot brings a kind of dirty..Does anyone give me opinions or suggestions?
The implementation of autovacuum work-item has been changed (by commit
31ae1638) because dynamic shared memory is not portable enough. IIUC
this patch is going to do the similar thing. Since stats collector is
also a critical part of the server, should we consider whether we can
change it? Or the portability problem is not relevant with this patch?
Thank you for the pointer. I digged out the following thread from
this and the the patch seems to be the consequence of the
discussion. I'll study it and think what to do on this.
/messages/by-id/20170814005656.d5tvz464qkmz66tq@alap3.anarazel.de
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 06 Feb 2018 19:24:37 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180206.192437.229464841.horiguchi.kyotaro@lab.ntt.co.jp>
At Tue, 6 Feb 2018 14:50:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCRn6Q0wGG7UwGVsQJZbocNsRaZByJomUy+-GRkVH-i9A@mail.gmail.com>
The implementation of autovacuum work-item has been changed (by commit
31ae1638) because dynamic shared memory is not portable enough. IIUC
this patch is going to do the similar thing. Since stats collector is
also a critical part of the server, should we consider whether we can
change it? Or the portability problem is not relevant with this patch?Thank you for the pointer. I digged out the following thread from
this and the the patch seems to be the consequence of the
discussion. I'll study it and think what to do on this./messages/by-id/20170814005656.d5tvz464qkmz66tq@alap3.anarazel.de
Done. The dominant reason for the ripping-off is that the
work-items array was allocated in a fixed-size DSA segment at
process startup time and wouldn't be resized.
Based on the reason, it fails to run when
dynamic_shared_memory_type = none and it is accompanied by
several cleanup complexities. The decision there is we should go
for just using static shared memory rather than adding complexity
for nothing. If it really needs to be expandable in the future,
it's the time to use DSA. (But would still maintain a fallback
stuff.)
Thinking of this, I think that this patch has a reason to use DSA
but needs some additional work.
- Fallback mechanism when dynamic_shared_memory_type = none
This means that the old file based stuff lives alongside the
DSA stuff. This is used when '= none' and on failure of DSA
mechanism.
- Out-of transactoin clean up stuff
Something like the following patch.
/messages/by-id/20170814231638.x6vgnzlr7eww4bui@alvherre.pgsql
And as known problems:
- Make it use less LWLocks.
- DSHash shrink mechanism. Maybe need to find the way to detect
the necessity to shrink.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Feb 6, 2018 at 8:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Based on the reason, it fails to run when
dynamic_shared_memory_type = none and it is accompanied by
several cleanup complexities. The decision there is we should go
for just using static shared memory rather than adding complexity
for nothing. If it really needs to be expandable in the future,
it's the time to use DSA. (But would still maintain a fallback
stuff.)
It seems to me that there was a thread where Tom proposed removing
support for dynamic_shared_memory_type = none. The main reason that I
included that option initially was because it seemed silly to risk
causing problems for users whose dynamic shared memory facilities
didn't work for the sake of a feature that, at the time (9.4), had no
in-core users.
But things have shifted a bit since then. We have had few complaints
about dynamic shared memory causing portability problems (except for
performance: apparently some implementations perform better than
others on some systems, and we need support for huge pages, but
neither of those things are a reason to disable it) and we now have
in-core use that is enabled by default. I suggest we remove support
for dynamic_shared_memory_type = none first, and see if we get any
complaints. If we don't, then future patches can rely on it being
present.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
It seems to me that there was a thread where Tom proposed removing
support for dynamic_shared_memory_type = none.
I think you're recalling <32138.1502675970@sss.pgh.pa.us>, wherein
I pointed out that
Whether that's worth the trouble is debatable. The current code
in initdb believes that every platform has some type of DSM support
(see choose_dsm_implementation). Nobody's complained about that,
and it certainly works on every buildfarm animal. So for all we know,
dynamic_shared_memory_type = none is broken already.
(That was in fact in the same thread Kyotaro-san just linked to about
reimplementing the stats collector.)
It's still true that we've no reason to believe there are any supported
platforms that haven't got some sort of DSM. Performance might be a
different question, of course ... but it's hard to believe that
transferring stats through DSM wouldn't be better than writing them
out to files.
I suggest we remove support for dynamic_shared_memory_type = none first,
and see if we get any complaints. If we don't, then future patches can
rely on it being present.
If we remove it in v11, it'd still be maybe a year from now before we'd
have much confidence from that alone that nobody cares. I think the lack
of complaints about it in 9.6 and 10 is a more useful data point.
regards, tom lane
Hello,
At Wed, 07 Feb 2018 16:59:20 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3246.1518040760@sss.pgh.pa.us>
Robert Haas <robertmhaas@gmail.com> writes:
It seems to me that there was a thread where Tom proposed removing
support for dynamic_shared_memory_type = none.I think you're recalling <32138.1502675970@sss.pgh.pa.us>, wherein
I pointed out thatWhether that's worth the trouble is debatable. The current code
in initdb believes that every platform has some type of DSM support
(see choose_dsm_implementation). Nobody's complained about that,
and it certainly works on every buildfarm animal. So for all we know,
dynamic_shared_memory_type = none is broken already.(That was in fact in the same thread Kyotaro-san just linked to about
reimplementing the stats collector.)It's still true that we've no reason to believe there are any supported
platforms that haven't got some sort of DSM. Performance might be a
different question, of course ... but it's hard to believe that
transferring stats through DSM wouldn't be better than writing them
out to files.
Good to hear. Thanks.
I suggest we remove support for dynamic_shared_memory_type = none first,
and see if we get any complaints. If we don't, then future patches can
rely on it being present.If we remove it in v11, it'd still be maybe a year from now before we'd
have much confidence from that alone that nobody cares. I think the lack
of complaints about it in 9.6 and 10 is a more useful data point.
So that means that we are assumed to be able to rely on the
existence of DSM at the present since over a year we had no
complain despite the fact that DSM is silently turned on? And
apart from that we are ready to remove 'none' from the options of
dynamic_shared_memory_type right now?
If I may rely on DSM, fallback stuff would not be required.
regards, tom lane
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 08 Feb 2018 18:04:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180208.180415.112312013.horiguchi.kyotaro@lab.ntt.co.jp>
I suggest we remove support for dynamic_shared_memory_type = none first,
and see if we get any complaints. If we don't, then future patches can
rely on it being present.If we remove it in v11, it'd still be maybe a year from now before we'd
have much confidence from that alone that nobody cares. I think the lack
of complaints about it in 9.6 and 10 is a more useful data point.So that means that we are assumed to be able to rely on the
existence of DSM at the present since over a year we had no
complain despite the fact that DSM is silently turned on? And
apart from that we are ready to remove 'none' from the options of
dynamic_shared_memory_type right now?
I found the follwoing commit related to this.
| commit d41ab71712a4457ed39d5471b23949872ac91def
| Author: Robert Haas <rhaas@postgresql.org>
| Date: Wed Oct 16 09:41:03 2013 -0400
|
| initdb: Suppress dynamic shared memory when probing for max_connections.
|
| This might not be the right long-term solution here, but it will
| hopefully turn the buildfarm green again.
|
| Oversight noted by Andres Freund
The discussion is found here.
/messages/by-id/CA+TgmoYHiiGrcvSSJhmbSEBMoF2zX_9_9rWd75Cwvu99YrDxew@mail.gmail.com
I suppose that the problem has not been resolved yet..
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
At Thu, 08 Feb 2018 18:21:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180208.182156.96551245.horiguchi.kyotaro@lab.ntt.co.jp>
I suppose that the problem has not been resolved yet..
I found several bugs during studying this but my conclusion here
is that the required decision here is that whether we regard the
unavailability of DSM as a fatal error as we do for out of
memory. Maybe we can go for the direction but just doing it
certainly let some buidfarm animals (at least anole? smew is not
found.) out of their lives.
I've not found the exact cause of the problem that regtest on the
bf animals always suceeded using sysv shmem but "postgres --boot"
by initdb alone can fail using the same mechanism. But regtest
seems to continue working if initdb sets max_connection to 20 or
more. At least it suceeds for me with the values max_connection
= 20 and shared_buffers=50MB on centos.
Finally, I'd like to propose the followings.
- kill dynamic_shared_memory_type = nune just now.
* server stops at startup if DSM is not available.
- Let initdb set max_connection = 20 as the fallback value in
the case. (Another porposed patch) And regression should
succeed with that.
If we are agreed on this, I will be able to go forward.
I want to have opinions on this from the expericed poeple.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center