Track the amount of time waiting due to cost_delay
Hi hackers,
During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT
Please find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view.
Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).
This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).
The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).
Looking forward to your feedback,
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload+23-4
On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
During the last pgconf.dev I attended Robert�s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT
This sounds like useful information to me. I wonder if we should also
surface the effective cost limit for each autovacuum worker.
Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).
IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); pg_usleep(msec * 1000); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + if (VacuumSharedCostBalance != NULL) + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.
--
nathan
Hi,
On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
During the last pgconf.dev I attended Robert�s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECTThis sounds like useful information to me.
Thanks for looking at it!
I wonder if we should also
surface the effective cost limit for each autovacuum worker.
I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.
Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?
Yeah, one could use a query such as:
select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
for example. Worth to provide an example somewhere in the doc?
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); pg_usleep(msec * 1000); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + if (VacuumSharedCostBalance != NULL) + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.
Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and specially
during a long running vacuum (which is probably the case where users care the
most).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
I wonder if we should also
surface the effective cost limit for each autovacuum worker.I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.
Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?
--
nathan
This sounds like useful information to me.
Thanks for looking at it!
The VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better. However, I also think that the
"number of times" the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit.
It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]https://www.postgresql.org/docs/current/monitoring-stats.html
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.
IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?
Yeah, one could use a query such as:
select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
Maybe all progress views should just expose the "beentry->st_activity_start_timestamp "
to let the user know when the current operation began.
Regards,
Sami Imseih
Amazon Web Services (AWS)
[0]: https://www.postgresql.org/docs/current/monitoring-stats.html
On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.
I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.
I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
I wonder if we should also
surface the effective cost limit for each autovacuum worker.I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?
Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 08:12:46PM +0000, Imseih (AWS), Sami wrote:
This sounds like useful information to me.
Thanks for looking at it!
The VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better.
Thanks for looking at it!
However, I also think that the
"number of times" the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit.
Yeah, I think that's a good idea. With v1 one could figure out how many times
the delay has been triggered but that does not work anymore if: 1) cost_delay
changed during the vacuum duration or 2) the patch changes the way time_delayed
is measured/reported (means get the actual wait time and not the theoritical
time as v1 does).
It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.
+1
IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?Yeah, one could use a query such as:
select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
Maybe all progress views should just expose the "beentry->st_activity_start_timestamp "
to let the user know when the current operation began.
Yeah maybe, I think this is likely a separate discussion too, thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 3:05 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi hackers,
During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECTPlease find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view.Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).
Thank you for the proposal and the patch. I understand the motivation
of this patch. Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.
FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects. It doesn't mean
to measure the impact of the changes on the vacuum duration, though.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.
Thanks for looking at it! Agree, that's how I read "or late" from Nathan's
comment above.
I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?
OTOH Sami proposed in [1]/messages/by-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF@amazon.com to count the number of times the vacuum went into
delay. I think that's a good idea. His idea makes me think that (in addition to
the number of wait times) it would make sense to measure the "actual" sleep time
(and not the intended one) then (so that one could measure the difference between
the intended wait time (number of wait times * cost delay (if it does not change
during the vacuum duration)) and the actual measured wait time).
So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).
Thoughts?
[1]: /messages/by-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF@amazon.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
Thank you for the proposal and the patch. I understand the motivation
of this patch.
Thanks for looking at it!
Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.
I see, I can/will do some testing in this area and share the numbers.
FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects.
Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the
other data you mentioned above in another dedicated view?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?
Though I (now) think that it would make sense to record the actual delay time
instead (see [1]/messages/by-id/Zmf712A5xcOM9Hlg@ip-10-97-1-34.eu-west-3.compute.internal), I think it's interesting to do some testing as you suggested.
With record_actual_time.txt (attached) applied on top of v1, we can see the
intended and actual wait time.
On my system, "no load at all" except the vacuum running, I see no diff:
Tue Jun 11 09:22:06 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 41107 | 41107 | 00:00:42.301851
(1 row)
Tue Jun 11 09:22:07 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 42076 | 42076 | 00:00:43.301848
(1 row)
Tue Jun 11 09:22:08 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 43045 | 43045 | 00:00:44.301854
(1 row)
But if I launch pg_reload_conf() 10 times in a row, I can see:
Tue Jun 11 09:22:09 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 44064 | 44034 | 00:00:45.302965
(1 row)
Tue Jun 11 09:22:10 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 45033 | 45003 | 00:00:46.301858
As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]/messages/by-id/Zmf712A5xcOM9Hlg@ip-10-97-1-34.eu-west-3.compute.internal). Does that make sense to you?
[1]: /messages/by-id/Zmf712A5xcOM9Hlg@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
record_actual_time.txttext/plain; charset=us-asciiDownload+16-3
On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).
I still think the effective cost limit would be useful, if for no other
reason than to help reinforce that it is distributed among the autovacuum
workers. We could document that this value may change over the lifetime of
a worker to help avoid misleading folks.
--
nathan
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?
I like the idea of reporting the actual wait time better, provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison. I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.
That said, I'm not sure this experiment shows a real problem with the
idea of showing intended wait time. It does establish the concept that
repeated signals can throw our numbers off, but 30ms isn't much of a
discrepancy. I'm worried about being off by a factor of two, or an
order of magnitude. I think we still don't know if that can happen,
but if we're going to show actual wait time anyway, then we don't need
to explore the problems with other hypothetical systems too much.
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 6/11/24 13:13, Robert Haas wrote:
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?I like the idea of reporting the actual wait time better, provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison. I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.
Does it even require a call to gettimeofday()? The code in vacuum
calculates an msec value and calls pg_usleep(msec * 1000). I don't think
it is necessary to measure how long that nap was.
Regards, Jan
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.
I think including the number of times vacuum went into sleep
will help paint a full picture of the effect of tuning the vacuum_cost_delay
and vacuum_cost_limit for the user, even if we are reporting accurate
amounts of actual sleeping.
This is particularly true for autovacuum in which the cost limit is spread
across all autovacuum workers, and knowing how many times autovacuum
went to sleep will be useful along with the total time spent sleeping.
Regards,
Sami
On Tue, Jun 11, 2024 at 06:19:23PM +0000, Imseih (AWS), Sami wrote:
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.I think including the number of times vacuum went into sleep
will help paint a full picture of the effect of tuning the vacuum_cost_delay
and vacuum_cost_limit for the user, even if we are reporting accurate
amounts of actual sleeping.This is particularly true for autovacuum in which the cost limit is spread
across all autovacuum workers, and knowing how many times autovacuum
went to sleep will be useful along with the total time spent sleeping.
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting. Even
if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.
--
nathan
On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting. Even
if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.
This is also how I see it.
--
Robert Haas
EDB: http://www.enterprisedb.com
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting. Even
if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.
This is also how I see it.
I think it may be useful for a user to be able to answer the "average
sleep time" for a vacuum, especially because the vacuum cost
limit and delay can be adjusted on the fly for a running vacuum.
If we only show the total sleep time, the user could make wrong
assumptions about how long each sleep took and they might
assume that all sleep delays for a particular vacuum run have been
uniform in duration, when in-fact they may not have been.
Regards,
Sami
Hi,
On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote:
On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).I still think the effective cost limit would be useful, if for no other
reason than to help reinforce that it is distributed among the autovacuum
workers.
I also think it can be useful, my concern is more to put this information in
pg_stat_progress_vacuum. What about Sawada-san proposal in [1]/messages/by-id/CAD21AoDOu=DZcC+PemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw@mail.gmail.com? (we could
create a new view that would contain those data: per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe).
[1]: /messages/by-id/CAD21AoDOu=DZcC+PemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw@mail.gmail.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com