Track the amount of time waiting due to cost_delay
Hi hackers,
During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT
Please find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view.
Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).
This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).
The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).
Looking forward to your feedback,
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 750dfc26cd6fcf5a5618c3fe35fc42d5b5c66f00 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Thu, 6 Jun 2024 12:35:57 +0000
Subject: [PATCH v1] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 11 +++++++++++
src/backend/catalog/system_views.sql | 3 ++-
src/backend/commands/vacuum.c | 6 ++++++
src/include/catalog/catversion.h | 2 +-
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 ++-
6 files changed, 23 insertions(+), 3 deletions(-)
47.6% doc/src/sgml/
3.7% src/backend/catalog/
26.8% src/backend/commands/
7.5% src/include/catalog/
4.1% src/include/commands/
10.0% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..cdd0f0e533 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..1345e99dcb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
- S.param8 AS indexes_total, S.param9 AS indexes_processed
+ S.param8 AS indexes_total, S.param9 AS indexes_processed,
+ S.param10 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..2551408a86 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -2386,6 +2387,11 @@ vacuum_delay_point(void)
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ if (VacuumSharedCostBalance != NULL)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
/*
* We don't want to ignore postmaster death during very long vacuums
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f0809c0e58..40b4f1d1e4 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202405161
+#define CATALOG_VERSION_NO 202406101
#endif
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 82a8fe6bd1..1fcefe9436 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -27,6 +27,7 @@
#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
#define PROGRESS_VACUUM_INDEXES_TOTAL 7
#define PROGRESS_VACUUM_INDEXES_PROCESSED 8
+#define PROGRESS_VACUUM_TIME_DELAYED 9
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ef658ad740..a499e44df1 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2053,7 +2053,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param6 AS max_dead_tuple_bytes,
s.param7 AS dead_tuple_bytes,
s.param8 AS indexes_total,
- s.param9 AS indexes_processed
+ s.param9 AS indexes_processed,
+ s.param10 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
During the last pgconf.dev I attended Robert�s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT
This sounds like useful information to me. I wonder if we should also
surface the effective cost limit for each autovacuum worker.
Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).
IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); pg_usleep(msec * 1000); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + if (VacuumSharedCostBalance != NULL) + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.
--
nathan
Hi,
On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
During the last pgconf.dev I attended Robert�s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECTThis sounds like useful information to me.
Thanks for looking at it!
I wonder if we should also
surface the effective cost limit for each autovacuum worker.
I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.
Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?
Yeah, one could use a query such as:
select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
for example. Worth to provide an example somewhere in the doc?
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); pg_usleep(msec * 1000); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + if (VacuumSharedCostBalance != NULL) + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec); + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.
Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and specially
during a long running vacuum (which is probably the case where users care the
most).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
I wonder if we should also
surface the effective cost limit for each autovacuum worker.I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.
Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?
--
nathan
This sounds like useful information to me.
Thanks for looking at it!
The VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better. However, I also think that the
"number of times" the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit.
It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]https://www.postgresql.org/docs/current/monitoring-stats.html
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.
IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?
Yeah, one could use a query such as:
select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
Maybe all progress views should just expose the "beentry->st_activity_start_timestamp "
to let the user know when the current operation began.
Regards,
Sami Imseih
Amazon Web Services (AWS)
[0]: https://www.postgresql.org/docs/current/monitoring-stats.html
On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.
I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.
I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
I wonder if we should also
surface the effective cost limit for each autovacuum worker.I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?
Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 08:12:46PM +0000, Imseih (AWS), Sami wrote:
This sounds like useful information to me.
Thanks for looking at it!
The VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better.
Thanks for looking at it!
However, I also think that the
"number of times" the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit.
Yeah, I think that's a good idea. With v1 one could figure out how many times
the delay has been triggered but that does not work anymore if: 1) cost_delay
changed during the vacuum duration or 2) the patch changes the way time_delayed
is measured/reported (means get the actual wait time and not the theoritical
time as v1 does).
It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.
+1
IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay. Is that how you'd expect for this to be used in practice?Yeah, one could use a query such as:
select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p JOIN pg_stat_activity a using (pid)
Maybe all progress views should just expose the "beentry->st_activity_start_timestamp "
to let the user know when the current operation began.
Yeah maybe, I think this is likely a separate discussion too, thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 3:05 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi hackers,
During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECTPlease find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view.Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).
Thank you for the proposal and the patch. I understand the motivation
of this patch. Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.
FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects. It doesn't mean
to measure the impact of the changes on the vacuum duration, though.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:Hm. Should we measure the actual time spent sleeping, or is a rough
estimate good enough? I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much. If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.
Thanks for looking at it! Agree, that's how I read "or late" from Nathan's
comment above.
I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?
OTOH Sami proposed in [1]/messages/by-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF@amazon.com to count the number of times the vacuum went into
delay. I think that's a good idea. His idea makes me think that (in addition to
the number of wait times) it would make sense to measure the "actual" sleep time
(and not the intended one) then (so that one could measure the difference between
the intended wait time (number of wait times * cost delay (if it does not change
during the vacuum duration)) and the actual measured wait time).
So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).
Thoughts?
[1]: /messages/by-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF@amazon.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
Thank you for the proposal and the patch. I understand the motivation
of this patch.
Thanks for looking at it!
Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.
I see, I can/will do some testing in this area and share the numbers.
FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects.
Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the
other data you mentioned above in another dedicated view?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?
Though I (now) think that it would make sense to record the actual delay time
instead (see [1]/messages/by-id/Zmf712A5xcOM9Hlg@ip-10-97-1-34.eu-west-3.compute.internal), I think it's interesting to do some testing as you suggested.
With record_actual_time.txt (attached) applied on top of v1, we can see the
intended and actual wait time.
On my system, "no load at all" except the vacuum running, I see no diff:
Tue Jun 11 09:22:06 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 41107 | 41107 | 00:00:42.301851
(1 row)
Tue Jun 11 09:22:07 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 42076 | 42076 | 00:00:43.301848
(1 row)
Tue Jun 11 09:22:08 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 43045 | 43045 | 00:00:44.301854
(1 row)
But if I launch pg_reload_conf() 10 times in a row, I can see:
Tue Jun 11 09:22:09 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 44064 | 44034 | 00:00:45.302965
(1 row)
Tue Jun 11 09:22:10 2024 (every 1s)
pid | relid | phase | time_delayed | actual_time_delayed | duration
-------+-------+---------------+--------------+---------------------+-----------------
54754 | 16385 | scanning heap | 45033 | 45003 | 00:00:46.301858
As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]/messages/by-id/Zmf712A5xcOM9Hlg@ip-10-97-1-34.eu-west-3.compute.internal). Does that make sense to you?
[1]: /messages/by-id/Zmf712A5xcOM9Hlg@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
record_actual_time.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1345e99dcb..e4ba8de00a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS indexes_total, S.param9 AS indexes_processed,
- S.param10 AS time_delayed
+ S.param10 AS time_delayed, S.param11 AS actual_time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2551408a86..bbb5002efe 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2381,18 +2381,29 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_time);
pgstat_report_wait_end();
/* Report the amount of time we slept */
+ INSTR_TIME_SUBTRACT(delay_time, delay_start);
if (VacuumSharedCostBalance != NULL)
+ {
pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_ACTUAL_TIME_DELAYED, INSTR_TIME_GET_MILLISEC(delay_time));
+ }
else
+ {
pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
-
+ pgstat_progress_incr_param(PROGRESS_VACUUM_ACTUAL_TIME_DELAYED, INSTR_TIME_GET_MILLISEC(delay_time));
+ }
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1fcefe9436..ec0efeec64 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_INDEXES_TOTAL 7
#define PROGRESS_VACUUM_INDEXES_PROCESSED 8
#define PROGRESS_VACUUM_TIME_DELAYED 9
+#define PROGRESS_VACUUM_ACTUAL_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a499e44df1..9dcc98e685 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,7 +2054,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS indexes_total,
s.param9 AS indexes_processed,
- s.param10 AS time_delayed
+ s.param10 AS time_delayed,
+ s.param11 AS actual_time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).
I still think the effective cost limit would be useful, if for no other
reason than to help reinforce that it is distributed among the autovacuum
workers. We could document that this value may change over the lifetime of
a worker to help avoid misleading folks.
--
nathan
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?
I like the idea of reporting the actual wait time better, provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison. I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.
That said, I'm not sure this experiment shows a real problem with the
idea of showing intended wait time. It does establish the concept that
repeated signals can throw our numbers off, but 30ms isn't much of a
discrepancy. I'm worried about being off by a factor of two, or an
order of magnitude. I think we still don't know if that can happen,
but if we're going to show actual wait time anyway, then we don't need
to explore the problems with other hypothetical systems too much.
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 6/11/24 13:13, Robert Haas wrote:
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?I like the idea of reporting the actual wait time better, provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison. I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.
Does it even require a call to gettimeofday()? The code in vacuum
calculates an msec value and calls pg_usleep(msec * 1000). I don't think
it is necessary to measure how long that nap was.
Regards, Jan
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.
I think including the number of times vacuum went into sleep
will help paint a full picture of the effect of tuning the vacuum_cost_delay
and vacuum_cost_limit for the user, even if we are reporting accurate
amounts of actual sleeping.
This is particularly true for autovacuum in which the cost limit is spread
across all autovacuum workers, and knowing how many times autovacuum
went to sleep will be useful along with the total time spent sleeping.
Regards,
Sami
On Tue, Jun 11, 2024 at 06:19:23PM +0000, Imseih (AWS), Sami wrote:
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.I think including the number of times vacuum went into sleep
will help paint a full picture of the effect of tuning the vacuum_cost_delay
and vacuum_cost_limit for the user, even if we are reporting accurate
amounts of actual sleeping.This is particularly true for autovacuum in which the cost limit is spread
across all autovacuum workers, and knowing how many times autovacuum
went to sleep will be useful along with the total time spent sleeping.
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting. Even
if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.
--
nathan
On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting. Even
if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.
This is also how I see it.
--
Robert Haas
EDB: http://www.enterprisedb.com
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting. Even
if the number of waits is huge, it doesn't tell you much else AFAICT. I'd
be much more likely to adjust the cost settings based on the percentage of
time spent sleeping.
This is also how I see it.
I think it may be useful for a user to be able to answer the "average
sleep time" for a vacuum, especially because the vacuum cost
limit and delay can be adjusted on the fly for a running vacuum.
If we only show the total sleep time, the user could make wrong
assumptions about how long each sleep took and they might
assume that all sleep delays for a particular vacuum run have been
uniform in duration, when in-fact they may not have been.
Regards,
Sami
Hi,
On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote:
On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).I still think the effective cost limit would be useful, if for no other
reason than to help reinforce that it is distributed among the autovacuum
workers.
I also think it can be useful, my concern is more to put this information in
pg_stat_progress_vacuum. What about Sawada-san proposal in [1]/messages/by-id/CAD21AoDOu=DZcC+PemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw@mail.gmail.com? (we could
create a new view that would contain those data: per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe).
[1]: /messages/by-id/CAD21AoDOu=DZcC+PemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw@mail.gmail.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote:
On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I'm struggling to think of a scenario in which the number of waits would be
useful, assuming you already know the amount of time spent waiting.
If we provide the actual time spent waiting, providing the number of waits would
allow to see if there is a diff between the actual time and the intended time
(i.e: number of waits * cost_delay, should the cost_delay be the same during
the vacuum duration). That should trigger some thoughts if the diff is large
enough.
I think that what we are doing here is to somehow add instrumentation around the
"WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait
events (generaly speaking) we'd probably also expose the number of waits per
wait event (in addition to the time waited).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote:
On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?I like the idea of reporting the actual wait time better,
+1
provided
that we verify that doing so isn't too expensive. I think it probably
isn't, because in a long-running VACUUM there is likely to be disk
I/O, so the CPU overhead of a few extra gettimeofday() calls should be
fairly low by comparison.
Agree.
I wonder if there's a noticeable hit when
everything is in-memory. I guess probably not, because with any sort
of normal configuration, we shouldn't be delaying after every block we
process, so the cost of those gettimeofday() calls should still be
getting spread across quite a bit of real work.
I did some testing, with:
shared_buffers = 12GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 0
max_parallel_workers = 0
added to a default config file.
A table and all its indexes were fully in memory, the numbers are:
postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
FROM pg_buffercache b JOIN pg_class c
ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
WHERE datname = current_database()))
JOIN pg_namespace n ON n.oid = c.relnamespace
GROUP BY n.nspname, c.relname
ORDER BY 3 DESC
LIMIT 11;
nspname | relname | buffers
---------+-------------------+---------
public | large_tbl | 222280
public | large_tbl_pkey | 5486
public | large_tbl_filler7 | 1859
public | large_tbl_filler4 | 1859
public | large_tbl_filler1 | 1859
public | large_tbl_filler6 | 1859
public | large_tbl_filler3 | 1859
public | large_tbl_filler2 | 1859
public | large_tbl_filler5 | 1859
public | large_tbl_filler8 | 1859
public | large_tbl_version | 1576
(11 rows)
The observed timings when vacuuming this table are:
On master:
vacuum phase: cumulative duration
---------------------------------
scanning heap: 00:00:37.808184
vacuuming indexes: 00:00:41.808176
vacuuming heap: 00:00:54.808156
On master patched with actual time delayed:
vacuum phase: cumulative duration
---------------------------------
scanning heap: 00:00:36.502104 (time_delayed: 22202)
vacuuming indexes: 00:00:41.002103 (time_delayed: 23769)
vacuuming heap: 00:00:54.302096 (time_delayed: 34886)
As we can see there is no noticeable degradation while the vacuum entered about
34886 times in this instrumentation code path (cost_delay was set to 1).
That said, I'm not sure this experiment shows a real problem with the
idea of showing intended wait time. It does establish the concept that
repeated signals can throw our numbers off, but 30ms isn't much of a
discrepancy.
Yeah, the idea was just to show how easy it is to create a 30ms discrepancy.
I'm worried about being off by a factor of two, or an
order of magnitude. I think we still don't know if that can happen,
but if we're going to show actual wait time anyway, then we don't need
to explore the problems with other hypothetical systems too much.
Agree.
I'm not convinced that reporting the number of waits is useful. If we
were going to report a possibly-inaccurate amount of actual waiting,
then also reporting the number of waits might make it easier to figure
out when the possibly-inaccurate number was in fact inaccurate. But I
think it's way better to report an accurate amount of actual waiting,
and then I'm not sure what we gain by also reporting the number of
waits.
Sami shared his thoughts in [1]/messages/by-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727@amazon.com and [2]/messages/by-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D@amazon.com and so did I in [3]/messages/by-id/ZmmGG4e+qTBD2kfn@ip-10-97-1-34.eu-west-3.compute.internal. If some of us still
don't think that reporting the number of waits is useful then we can probably
start without it.
[1]: /messages/by-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727@amazon.com
[2]: /messages/by-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D@amazon.com
[3]: /messages/by-id/ZmmGG4e+qTBD2kfn@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote:
Hi,
On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
Thank you for the proposal and the patch. I understand the motivation
of this patch.Thanks for looking at it!
Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.I see, I can/will do some testing in this area and share the numbers.
Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.
====================== Context ================================================
The testing has been done with this relation (large_tbl) and its indexes:
postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
FROM pg_buffercache b JOIN pg_class c
ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
WHERE datname = current_database()))
JOIN pg_namespace n ON n.oid = c.relnamespace
GROUP BY n.nspname, c.relname
ORDER BY 3 DESC
LIMIT 22;
nspname | relname | buffers
---------+--------------------+---------
public | large_tbl | 222280
public | large_tbl_filler13 | 125000
public | large_tbl_filler6 | 125000
public | large_tbl_filler5 | 125000
public | large_tbl_filler3 | 125000
public | large_tbl_filler15 | 125000
public | large_tbl_filler4 | 125000
public | large_tbl_filler20 | 125000
public | large_tbl_filler18 | 125000
public | large_tbl_filler14 | 125000
public | large_tbl_filler8 | 125000
public | large_tbl_filler11 | 125000
public | large_tbl_filler19 | 125000
public | large_tbl_filler7 | 125000
public | large_tbl_filler1 | 125000
public | large_tbl_filler12 | 125000
public | large_tbl_filler9 | 125000
public | large_tbl_filler17 | 125000
public | large_tbl_filler16 | 125000
public | large_tbl_filler10 | 125000
public | large_tbl_filler2 | 125000
public | large_tbl_pkey | 5486
(22 rows)
All of them completly fit in memory (to avoid I/O read latency during the vacuum).
The config, outside of default is:
max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10
My system is not overloaded, has enough resources to run this test and only this
test is running.
====================== Results ================================================
========== With v2 (attached) applied on master
postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)
The duration is splitted that way:
Vacuum phase: cumulative time (cumulative time delayed)
=======================================================
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)
I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):
leader_pid | pid | wait_event | count
------------+--------+----------------+-------
452996 | 453225 | VacuumDelay | 366
452996 | 453223 | VacuumDelay | 363
452996 | 453226 | VacuumDelay | 362
452996 | 453224 | VacuumDelay | 361
452996 | 453222 | VacuumDelay | 359
452996 | 453221 | VacuumDelay | 359
| 452996 | VacuumDelay | 331
| 452996 | CPU | 30
452996 | 453224 | WALWriteLock | 23
452996 | 453222 | WALWriteLock | 20
452996 | 453226 | WALWriteLock | 20
452996 | 453221 | WALWriteLock | 19
| 452996 | WalSync | 18
452996 | 453225 | WALWriteLock | 18
452996 | 453223 | WALWriteLock | 16
| 452996 | WALWriteLock | 15
452996 | 453221 | CPU | 14
452996 | 453222 | CPU | 14
452996 | 453223 | CPU | 12
452996 | 453224 | CPU | 10
452996 | 453226 | CPU | 10
452996 | 453225 | CPU | 8
452996 | 453223 | WalSync | 4
452996 | 453221 | WalSync | 2
452996 | 453226 | WalWrite | 2
452996 | 453221 | WalWrite | 1
| 452996 | ParallelFinish | 1
452996 | 453224 | WalSync | 1
452996 | 453225 | WalSync | 1
452996 | 453222 | WalWrite | 1
452996 | 453225 | WalWrite | 1
452996 | 453222 | WalSync | 1
452996 | 453226 | WalSync | 1
========== On master (v2 not applied)
postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1322598.087 ms (22:02.598)
Surprisingly it has been longer on master by about 3 minutes.
Let's see how the time is splitted:
Vacuum phase: cumulative time
=============================
scanning heap: 00:08:07.061196
vacuuming indexes: 00:17:50.961228
vacuuming heap: 00:22:02.561199
I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):
leader_pid | pid | wait_event | count
------------+--------+-------------------+-------
468682 | 468858 | VacuumDelay | 548
468682 | 468862 | VacuumDelay | 547
468682 | 468859 | VacuumDelay | 547
468682 | 468860 | VacuumDelay | 545
468682 | 468857 | VacuumDelay | 543
468682 | 468861 | VacuumDelay | 542
| 468682 | VacuumDelay | 378
| 468682 | ParallelFinish | 182
468682 | 468861 | WALWriteLock | 19
468682 | 468857 | WALWriteLock | 19
468682 | 468859 | WALWriteLock | 18
468682 | 468858 | WALWriteLock | 16
468682 | 468860 | WALWriteLock | 15
468682 | 468862 | WALWriteLock | 15
468682 | 468862 | CPU | 12
468682 | 468857 | CPU | 10
468682 | 468859 | CPU | 10
468682 | 468861 | CPU | 10
| 468682 | CPU | 9
468682 | 468860 | CPU | 9
468682 | 468860 | WalSync | 8
| 468682 | WALWriteLock | 7
468682 | 468858 | WalSync | 6
468682 | 468858 | CPU | 6
468682 | 468862 | WalSync | 3
468682 | 468857 | WalSync | 3
468682 | 468861 | WalSync | 3
468682 | 468859 | WalSync | 2
468682 | 468861 | WalWrite | 2
468682 | 468857 | WalWrite | 1
468682 | 468858 | WalWrite | 1
468682 | 468861 | WALBufMappingLock | 1
468682 | 468857 | WALBufMappingLock | 1
| 468682 | WALBufMappingLock | 1
====================== Observations ===========================================
As compare to v2:
1. scanning heap time is about the same
2. vacuuming indexes time is about 3 minutes longer on master
3. vacuuming heap time is about the same
One difference we can see in the sampling, is that on master the "ParallelFinish"
has been sampled about 182 times (so could be _the_ 3 minutes of interest) for
the leader.
On master the vacuum indexes phase has been running between 2024-06-13 10:11:34
and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the
"ParallelFinish" wait event I get:
minute | wait_event | count
--------+----------------+-------
18 | ParallelFinish | 48
19 | ParallelFinish | 60
20 | ParallelFinish | 60
21 | ParallelFinish | 14
So it's likely that the leader waited on ParallelFinish during about 3 minutes
at the end of the vacuuming indexes phase (as this wait appeared during
consecutives samples).
====================== Conclusion =============================================
1. During the scanning heap and vacuuming heap phases no noticeable performance
degradation has been observed with v2 applied (as compare to master) (cc'ing
Robert as it's also related to his question about noticeable hit when everything
is in-memory in [1]/messages/by-id/CA+TgmoZiC=zeCDYuMpB+Gb2yK=rTQCGMu0VoxehocKyHxr9Erg@mail.gmail.com).
2. During the vacuuming indexes phase, v2 has been faster (as compare to master).
The reason is that on master the leader has been waiting during about 3 minutes
on "ParallelFinish" at the end.
====================== Remarks ================================================
As v2 is attached, please find below a summary about the current state of this
thread:
1. v2 implements delay_time as the actual wait time (and not the intended wait
time as proposed in v1).
2. some measurements have been done to check the impact of this new
instrumentation (see this email and [2]/messages/by-id/ZmmOOPwMFIltkdsN@ip-10-97-1-34.eu-west-3.compute.internal): no noticeable performance degradation
has been observed (and surprisingly that's the opposite as mentioned above).
3. there is an ongoing discussion about exposing the number of waits [2]/messages/by-id/ZmmOOPwMFIltkdsN@ip-10-97-1-34.eu-west-3.compute.internal.
4. there is an ongoing discussion about exposing the effective cost limit [3]/messages/by-id/Zml9+u37iS7DFkJL@ip-10-97-1-34.eu-west-3.compute.internal.
5. that could be interesting to have a closer look as to why the leader is waiting
during 3 minutes on "ParallelFinish" on master and not with v2 applied (but that's
probably out of scope for this thread).
[1]: /messages/by-id/CA+TgmoZiC=zeCDYuMpB+Gb2yK=rTQCGMu0VoxehocKyHxr9Erg@mail.gmail.com
[2]: /messages/by-id/ZmmOOPwMFIltkdsN@ip-10-97-1-34.eu-west-3.compute.internal
[3]: /messages/by-id/Zml9+u37iS7DFkJL@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 21eeab61c125a7ca4afccd3bc5961a1f060f0b9a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Thu, 6 Jun 2024 12:35:57 +0000
Subject: [PATCH v2] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 11 +++++++++++
src/backend/catalog/system_views.sql | 3 ++-
src/backend/commands/vacuum.c | 15 +++++++++++++++
src/include/catalog/catversion.h | 2 +-
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 ++-
6 files changed, 32 insertions(+), 3 deletions(-)
37.5% doc/src/sgml/
42.3% src/backend/commands/
5.9% src/include/catalog/
3.2% src/include/commands/
7.9% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..cdd0f0e533 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..1345e99dcb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
- S.param8 AS indexes_total, S.param9 AS indexes_processed
+ S.param8 AS indexes_total, S.param9 AS indexes_processed,
+ S.param10 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..5c40ee6e2c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -2380,13 +2381,27 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_time);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SUBTRACT(delay_time, delay_start);
+ if (VacuumSharedCostBalance != NULL)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delay_time));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delay_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f0809c0e58..40b4f1d1e4 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202405161
+#define CATALOG_VERSION_NO 202406101
#endif
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 82a8fe6bd1..1fcefe9436 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -27,6 +27,7 @@
#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6
#define PROGRESS_VACUUM_INDEXES_TOTAL 7
#define PROGRESS_VACUUM_INDEXES_PROCESSED 8
+#define PROGRESS_VACUUM_TIME_DELAYED 9
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ef658ad740..a499e44df1 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2053,7 +2053,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param6 AS max_dead_tuple_bytes,
s.param7 AS dead_tuple_bytes,
s.param8 AS indexes_total,
- s.param9 AS indexes_processed
+ s.param9 AS indexes_processed,
+ s.param10 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Hi,
On Thu, Jun 13, 2024 at 11:56:26AM +0000, Bertrand Drouvot wrote:
====================== Observations ===========================================
As compare to v2:
1. scanning heap time is about the same
2. vacuuming indexes time is about 3 minutes longer on master
3. vacuuming heap time is about the same
I had a closer look to understand why the vacuuming indexes time has been about
3 minutes longer on master.
During the vacuuming indexes phase, the leader is helping vacuuming the indexes
until it reaches WaitForParallelWorkersToFinish() (means when all the remaining
indexes are currently handled by the parallel workers, the leader has nothing
more to do and so it is waiting for the parallel workers to finish).
During the time the leader process is involved in indexes vacuuming it is
also subject to wait due to cost delay.
But with v2 applied, the leader may be interrupted by the parallel workers while
it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that
the patch is adding).
So, with v2 applied, the leader is waiting less (as interrupted while waiting)
when being involved in indexes vacuuming and that's why v2 is "faster" than
master.
To put some numbers, I did count the number of times the leader's pg_usleep() has
been interrupted (by counting the number of times the nanosleep() did return a
value < 0 in pg_usleep()). Here they are:
v2: the leader has been interrupted about 342605 times
master: the leader has been interrupted about 36 times
The ones on master are mainly coming from the pgstat_progress_parallel_incr_param()
calls in parallel_vacuum_process_one_index().
The additional ones on v2 are coming from the pgstat_progress_parallel_incr_param()
calls added in vacuum_delay_point().
======== Conclusion ======
1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".
2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in
46ebdfe164). It has been the case "only" 36 times during my test case.
I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).
I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote:
1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in
46ebdfe164). It has been the case "only" 36 times during my test case.I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.
I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).
Please find attached v3, that:
- ensures that there is at least 1 second between 2 reports, per parallel worker,
to the leader.
- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).
- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).
Remarks:
1. Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable. I don't think that the number of parallel workers has
to come into play as:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The throttling is not based on the cost limit as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The throttling is not based on the actual cost delay value because the leader
could be interrupted at the beginning, the midle or whatever part of the wait and
we are more interested about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
With this approach in place, v3 attached applied, during my test case:
- the leader has been interrupted about 2500 times (instead of about 345000
times with v2)
- the vacuum index phase duration is very close to the master one (it has been
4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3
minutes faster with v2).
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 99f417c0bcd7c29e126fdccdd6214ea37db67379 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v3] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 11 +++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++
src/include/catalog/catversion.h | 2 +-
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
6 files changed, 65 insertions(+), 3 deletions(-)
19.7% doc/src/sgml/
4.4% src/backend/catalog/
66.6% src/backend/commands/
3.1% src/include/catalog/
4.2% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b2ad9b446f..e9608fb6fe 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6299,6 +6299,17 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index efb29adeb3..74b2ef12af 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..03470a450f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -60,6 +61,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -103,6 +110,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2380,13 +2397,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (VacuumSharedCostBalance != NULL)
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b3322e8d67..752473a44e 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202406171
+#define CATALOG_VERSION_NO 202406241
#endif
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 13178e2b3d..54c8d9d042 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,7 +2054,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in
46ebdfe164). It has been the case "only" 36 times during my test case.
46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1]https://man7.org/linux/man-pages/man2/nanosleep.2.html, but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process.
1. Having a time based only approach to throttle
I do agree with a time based approach overall.
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
Did you mean " because the vacuum is done on more processes"?
When a leader is operating on a large index(s) during the entirety
of the vacuum operation, wouldn't more parallel workers end up
interrupting the leader more often? This is why I think reporting even more
often than 1 second (more below) will be better.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
I feel 1 second may still be too frequent.
What about 10 seconds ( or 30 seconds )?
I think this metric in particular will be mainly useful for vacuum runs that are
running for minutes or more, making reporting every 10 or 30 seconds
still useful.
It just occurred to me also that pgstat_progress_parallel_incr_param
should have a code comment that it will interrupt a leader process and
cause activity such as a sleep to end early.
Regards,
Sami Imseih
Amazon Web Services (AWS)
Hi,
On Tue, Jun 25, 2024 at 01:12:16AM +0000, Imseih (AWS), Sami wrote:
Thanks for the feedback!
2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in
46ebdfe164). It has been the case "only" 36 times during my test case.46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process.
+1. I don't think it's necessary to have a 100% accurate delay for all the
times the delay is involded. I think that's an heuristic parameter (among
with cost limit). What matters at the end is by how much you've been able to
pause the whole vacuum (and not by a sleep by sleep basis)).
1. Having a time based only approach to throttle
I do agree with a time based approach overall.
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).Did you mean " because the vacuum is done on more processes"?
Yes.
When a leader is operating on a large index(s) during the entirety
of the vacuum operation, wouldn't more parallel workers end up
interrupting the leader more often?
That's right but my point was about the impact on the "whole" duration time and
"whole" workload (leader + workers included) and not about the number of times the
leader is interrupted. If there is say 100 workers then interrupting the leader
(1 process out of 101) is probably less of an issue as it means that there is a
lot of work to be done to have those 100 workers busy. I don't think the size of
the index the leader is vacuuming has an impact. I think that having the leader
vacuuming a 100 GB index or 100 x 1GB indexes is the same (as long as all the
other workers are actives during all that time).
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).I feel 1 second may still be too frequent.
Maybe we'll need more measurements but this is what my test case made of:
vacuum_cost_delay = 1
vacuum_cost_limit = 10
8 parallel workers, 1 leader
21 indexes (about 1GB each, one 40MB), all in memory
lead to:
With 1 second reporting frequency, the leader has been interruped about 2500
times over 8m39s leading to about the same time as on master (8m43s).
What about 10 seconds ( or 30 seconds )?
I'm not sure (may need more measurements) but it would probably complicate the
reporting a bit (as with the current v3 we'd miss reporting the indexes that
take less time than the threshold to complete).
I think this metric in particular will be mainly useful for vacuum runs that are
running for minutes or more, making reporting every 10 or 30 seconds
still useful.
Agree. OTOH, one could be interested to diagnose what happened during a say 5
seconds peak on I/O resource consumption/latency. Sampling pg_stat_progress_vacuum
at 1 second interval and see by how much the vaccum has been paused during that
time could help too (specially if it is made of a lot of parallel workers that
could lead to a lot of I/O). But it would miss data if we are reporting at a
larger rate.
It just occurred to me also that pgstat_progress_parallel_incr_param
should have a code comment that it will interrupt a leader process and
cause activity such as a sleep to end early.
Good point, I'll add a comment for it.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Jun 24, 2024 at 7:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Sat, Jun 22, 2024 at 12:48:33PM +0000, Bertrand Drouvot wrote:
1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".2. the leader being interrupted while waiting is also already happening on master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in
46ebdfe164). It has been the case "only" 36 times during my test case.I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.
Thank you for the benchmarking and analyzing the results! I agree with
your analysis and was surprised by the fact that the more times
workers go to sleep, the more times the leader wakes up.
I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).Please find attached v3, that:
- ensures that there is at least 1 second between 2 reports, per parallel worker,
to the leader.- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).
Sounds good to me. I think it's better to keep the logic for
throttling the reporting the delay message simple. It's an important
consideration but executing parallel vacuum with delays would be less
likely to be used in practice.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process.
After discussing offline with Bertrand, it may be better to have
a solution to deal with the interrupts and allows the sleep to continue to
completion. This will simplify this patch and will be useful
for other cases in which parallel workers need to send a message
to the leader. This is the thread [1]/messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com for that discussion.
[1]: /messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com
Regards,
Sami
Hi,
On Fri, Jun 28, 2024 at 08:07:39PM +0000, Imseih (AWS), Sami wrote:
46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process.After discussing offline with Bertrand, it may be better to have
a solution to deal with the interrupts and allows the sleep to continue to
completion. This will simplify this patch and will be useful
for other cases in which parallel workers need to send a message
to the leader. This is the thread [1] for that discussion.[1] /messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com
Yeah, I think it would make sense to put this thread on hold until we know more
about [1] (you mentioned above) outcome.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Mon, Jul 01, 2024 at 04:59:25AM +0000, Bertrand Drouvot wrote:
Hi,
On Fri, Jun 28, 2024 at 08:07:39PM +0000, Imseih (AWS), Sami wrote:
46ebdfe164 will interrupt the leaders sleep every time a parallel workers reports
progress, and we currently don't handle interrupts by restarting the sleep with
the remaining time. nanosleep does provide the ability to restart with the remaining
time [1], but I don't think it's worth the effort to ensure more accurate
vacuum delays for the leader process.After discussing offline with Bertrand, it may be better to have
a solution to deal with the interrupts and allows the sleep to continue to
completion. This will simplify this patch and will be useful
for other cases in which parallel workers need to send a message
to the leader. This is the thread [1] for that discussion.[1] /messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com
Yeah, I think it would make sense to put this thread on hold until we know more
about [1] (you mentioned above) outcome.
As it looks like we have a consensus not to wait on [0]/messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com (as reducing the number
of interrupts makes sense on its own), then please find attached v4, a rebase
version (that also makes clear in the doc that that new field might show slightly
old values, as mentioned in [1]/messages/by-id/ZruMe-ppopQX4uP8@nathan).
[0]: /messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com
[1]: /messages/by-id/ZruMe-ppopQX4uP8@nathan
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 90196125d1262095d02f0df74bb6cab0d03c75ff Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v4] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 ++++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++
src/include/catalog/catversion.h | 2 +-
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
6 files changed, 67 insertions(+), 3 deletions(-)
23.5% doc/src/sgml/
4.2% src/backend/catalog/
63.4% src/backend/commands/
4.6% src/include/
4.0% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9..d87604331a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6307,6 +6307,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. This
+ column is updated at a 1 Hz frequency (one time per second) so could show
+ slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 19cabc9a47..875df7d0e4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..5bf2e37d3f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -60,6 +61,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -103,6 +110,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2377,13 +2394,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 9a0ae27823..ec1f13748f 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202408122
+#define CATALOG_VERSION_NO 202408201
#endif
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 862433ee52..2bef31a66d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2052,7 +2052,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Hi,
On Tue, Aug 20, 2024 at 12:48:29PM +0000, Bertrand Drouvot wrote:
As it looks like we have a consensus not to wait on [0] (as reducing the number
of interrupts makes sense on its own), then please find attached v4, a rebase
version (that also makes clear in the doc that that new field might show slightly
old values, as mentioned in [1]).
Please find attached v5, a mandatory rebase.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 1a14b708e0ee74c2f38835968d828c54022a5526 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v5] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 ++++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++
src/include/catalog/catversion.h | 2 +-
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
6 files changed, 67 insertions(+), 3 deletions(-)
23.5% doc/src/sgml/
4.2% src/backend/catalog/
63.4% src/backend/commands/
4.6% src/include/
4.0% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9..d87604331a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6307,6 +6307,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. This
+ column is updated at a 1 Hz frequency (one time per second) so could show
+ slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 19cabc9a47..875df7d0e4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..5bf2e37d3f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -60,6 +61,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -103,6 +110,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2377,13 +2394,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 1980d492c3..fbee0db2eb 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202408301
+#define CATALOG_VERSION_NO 202409021
#endif
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 862433ee52..2bef31a66d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2052,7 +2052,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Hi,
On Mon, Sep 02, 2024 at 05:11:36AM +0000, Bertrand Drouvot wrote:
Hi,
On Tue, Aug 20, 2024 at 12:48:29PM +0000, Bertrand Drouvot wrote:
As it looks like we have a consensus not to wait on [0] (as reducing the number
of interrupts makes sense on its own), then please find attached v4, a rebase
version (that also makes clear in the doc that that new field might show slightly
old values, as mentioned in [1]).Please find attached v5, a mandatory rebase.
Please find attached v6, a mandatory rebase due to catversion bump conflict.
I'm removing the catversion bump from the patch as it generates too frequent
conflicts (just mention it needs to be done in the commit message).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v6-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 45be7dfd86948415962696128a17a68e49c9a773 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v6] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 ++++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
5 files changed, 66 insertions(+), 2 deletions(-)
24.2% doc/src/sgml/
4.3% src/backend/catalog/
65.4% src/backend/commands/
4.2% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..64b0604e04 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6380,6 +6380,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. This
+ column is updated at a 1 Hz frequency (one time per second) so could show
+ slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7fd5d256a1..a40888ef2a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..5bf2e37d3f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -60,6 +61,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -103,6 +110,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2377,13 +2394,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a1626f3fae..af3a92f882 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2052,7 +2052,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
Please find attached v6, a mandatory rebase due to catversion bump conflict.
I'm removing the catversion bump from the patch as it generates too frequent
conflicts (just mention it needs to be done in the commit message).
v6 looks generally reasonable to me. I think the
nap_time_since_last_report variable needs to be marked static, though.
One thing that occurs to me is that this information may not be
particularly useful when parallel workers are used. Without parallelism,
it's easy enough to figure out the percentage of time that your VACUUM is
spending asleep, but when there are parallel workers, it may be hard to
deduce much of anything from the value. I'm not sure that this is a
deal-breaker for the patch, though, if for no other reason than it'll most
likely be used for autovacuum, which doesn't use parallel vacuum yet.
If there are no other concerns, I'll plan on committing this one soon after
a bit of editorialization.
--
nathan
Hi,
On Wed, Sep 18, 2024 at 04:04:53PM -0500, Nathan Bossart wrote:
On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
Please find attached v6, a mandatory rebase due to catversion bump conflict.
I'm removing the catversion bump from the patch as it generates too frequent
conflicts (just mention it needs to be done in the commit message).v6 looks generally reasonable to me.
Thanks for looking at it!
I think the
nap_time_since_last_report variable needs to be marked static, though.
Agree.
One thing that occurs to me is that this information may not be
particularly useful when parallel workers are used. Without parallelism,
it's easy enough to figure out the percentage of time that your VACUUM is
spending asleep, but when there are parallel workers, it may be hard to
deduce much of anything from the value.
I think that if the number of parallel workers being used are the same across
runs then one can measure "accurately" the impact of some changes (set
vacuum_cost_delay=... for example) on the delay. Without the patch one could just
guess as many others factors could impact the vacuum duration (load on the system,
i/o latency,...).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Thu, Sep 19, 2024 at 07:54:21AM +0000, Bertrand Drouvot wrote:
Hi,
On Wed, Sep 18, 2024 at 04:04:53PM -0500, Nathan Bossart wrote:
On Thu, Sep 05, 2024 at 04:59:54AM +0000, Bertrand Drouvot wrote:
Please find attached v6, a mandatory rebase due to catversion bump conflict.
I'm removing the catversion bump from the patch as it generates too frequent
conflicts (just mention it needs to be done in the commit message).v6 looks generally reasonable to me.
Thanks for looking at it!
I think the
nap_time_since_last_report variable needs to be marked static, though.Agree.
Please find attached v7 where nap_time_since_last_report is declared as static.
That's the only change as compared to v6.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v7-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 7470ac76d5f3a9165d6d0e5b8b20a0fe16ce4b6a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v7] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 ++++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
5 files changed, 66 insertions(+), 2 deletions(-)
24.1% doc/src/sgml/
4.3% src/backend/catalog/
65.4% src/backend/commands/
4.2% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 331315f8d3..8b6330830b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6410,6 +6410,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. This
+ column is updated at a 1 Hz frequency (one time per second) so could show
+ slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3456b821bc..9ed8cfce70 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1220,7 +1220,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac8f5d9c25..d56ab66d50 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -60,6 +61,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -103,6 +110,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+static double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2402,13 +2419,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2b47013f11..a7baa04441 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,7 +2054,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Hi,
I recently encountered a case where having this feature would have been
very helpful.
Thank you for developing it! I have a few questions and comments.
Here are questions:
After this patch is merged, are you considering adding delayed_time
information to the logs output by log_autovacuum_min_duration?
In the case I experienced, it would have been great to easily understand
how much of the total execution time was spent on timed delays from the
already executed VACUUM logs.
Recently, this thread has not been active. Is the reason to wait for
[1]: Vacuum statistics: https://commitfest.postgresql.org/50/5012/
[1]: Vacuum statistics: https://commitfest.postgresql.org/50/5012/
Here are minor comments on the v7 patch:
+ Total amount of time spent in milliseconds waiting due to
<varname>vacuum_cost_delay</varname>
+ or <varname>autovacuum_vacuum_cost_delay</varname>. In case of
parallel
Why not use the <xref> element, for example, <xref
linkend="guc-autovacuum-vacuum-cost-delay"/>,
as in the max_dead_tuple_bytes column?
+ vacuum the reported time is across all the workers and the
leader. This
+ column is updated at a 1 Hz frequency (one time per second) so
could show
+ slightly old values.
I wonder if "Hz frequency" is the best term for the context, as I
couldn’t
find similar usage in other documents, though I’m not a native English
speaker.
FWIW, the document contains a similar description.
* not more frequently than once per PGSTAT_MIN_INTERVAL milliseconds
IIUC, only the worker updates the column at a 1 Hz frequency. Would it
be
better to rephrase the following?"
* The workers update the column no more frequently than once per second,
so it could show slightly old values.
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) >
WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
IIUC, unsent delayed_time will disappear when the parallel workers exit
because they are not considered in parallel_vacuum_end(). I assume this
is intentional behavior, as it is an acceptable error for the use cases.
I didn't see any comments regarding this, so I wanted to confirm.
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
Hi,
On Thu, Dec 05, 2024 at 05:51:11PM +0900, Masahiro Ikeda wrote:
Hi,
I recently encountered a case where having this feature would have been very
helpful.
Oh great, thanks for the feedback!
Thank you for developing it! I have a few questions and comments.
Here are questions:
After this patch is merged, are you considering adding delayed_time
information to the logs output by log_autovacuum_min_duration?
In the case I experienced, it would have been great to easily understand
how much of the total execution time was spent on timed delays from the
already executed VACUUM logs.
That's a good point. We already discussed adding some information in a dedicated
view ([1]/messages/by-id/CAD21AoDOu=DZcC+PemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw@mail.gmail.com) (and that's an idea I keep in mind). I also think your idea is worth
it and that it would make sense to start a dedicated thread once this one is
merged.
Recently, this thread has not been active.
I think than Nathan wants to give time to others to interact on it like you
do ;-) (Nathan please correct me if I'm wrong).
Here are minor comments on the v7 patch:
Thanks!
Why not use the <xref> element, for example, <xref
linkend="guc-autovacuum-vacuum-cost-delay"/>,
as in the max_dead_tuple_bytes column?
There is multiple places where "<varname>vacuum_cost_delay</varname>" is
being used but I agree that's better to be consistent with how it is done for
this view. Done in v8 attached.
IIUC, only the worker updates the column at a 1 Hz frequency. Would it be
better to rephrase the following?"
* The workers update the column no more frequently than once per second,
so it could show slightly old values.
Yeah I like the re-wording, done that way in v8.
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL) + { + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + nap_time_since_last_report); + nap_time_since_last_report = 0; + last_report_time = delay_end; + }IIUC, unsent delayed_time will disappear when the parallel workers exit
because they are not considered in parallel_vacuum_end(). I assume this
is intentional behavior, as it is an acceptable error for the use cases.
Yeah, people would likely use this new field to monitor long running vacuum.
Long enough that this error should be acceptable. Do you agree?
I didn't see any comments regarding this, so I wanted to confirm.
Added a comment to make it clear, thanks!
[1]: /messages/by-id/CAD21AoDOu=DZcC+PemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw@mail.gmail.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v8-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From fc4b761a917804e6e7de46868d388a41735d8cca Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v8] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 +++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 53 ++++++++++++++++++++++++++++
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
5 files changed, 70 insertions(+), 2 deletions(-)
22.8% doc/src/sgml/
4.0% src/backend/catalog/
67.6% src/backend/commands/
3.8% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..7386f7333d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <xref linkend="guc-vacuum-cost-delay"/>
+ or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. The
+ workers update the column no more frequently than once per second, so it
+ could show slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..013bd06222 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..6f9e515f56 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,16 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ *
+ * Note that unsent delayed_time will disappear when the parallel workers exit
+ * because they are not considered in parallel_vacuum_end(). That's an acceptable
+ * error for the use cases.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -102,6 +113,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+static double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2402,13 +2423,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..8b1154efac 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2056,7 +2056,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Hi,
On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
Yeah, people would likely use this new field to monitor long running vacuum.
Long enough that this error should be acceptable. Do you agree?
OTOH, adding the 100% accuracy looks as simple as v9-0002 attached (0001 is
same as for v8), so I think we should provide it.
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v9-0001-Report-the-total-amount-of-time-that-vacuum-has-b.patchtext/x-diff; charset=us-asciiDownload
From 8be8a71eb3c010d51bd6749dce33794f763c8572 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v9 1/2] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 +++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 53 ++++++++++++++++++++++++++++
src/include/commands/progress.h | 1 +
src/test/regress/expected/rules.out | 3 +-
5 files changed, 70 insertions(+), 2 deletions(-)
22.8% doc/src/sgml/
4.0% src/backend/catalog/
67.6% src/backend/commands/
3.8% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..7386f7333d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting due to <xref linkend="guc-vacuum-cost-delay"/>
+ or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. The
+ workers update the column no more frequently than once per second, so it
+ could show slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..013bd06222 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..6f9e515f56 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,16 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ *
+ * Note that unsent delayed_time will disappear when the parallel workers exit
+ * because they are not considered in parallel_vacuum_end(). That's an acceptable
+ * error for the use cases.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -102,6 +113,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+static double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2402,13 +2423,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..8b1154efac 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2056,7 +2056,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
v9-0002-Report-the-amount-of-time-we-slept-before-exiting.patchtext/x-diff; charset=us-asciiDownload
From 5b382029eb8330bf9daef50093bb483d8e48e6c2 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Fri, 6 Dec 2024 07:39:19 +0000
Subject: [PATCH v9 2/2] Report the amount of time we slept before exiting
parallel workers
or we might get incomplete data due to WORKER_REPORT_DELAY_INTERVAL
---
src/backend/commands/vacuum.c | 6 +-----
src/backend/commands/vacuumparallel.c | 7 +++++++
src/include/commands/vacuum.h | 1 +
3 files changed, 9 insertions(+), 5 deletions(-)
90.3% src/backend/commands/
9.6% src/include/commands/
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6f9e515f56..7ad42a9507 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -64,10 +64,6 @@
* Minimum amount of time (in ms) between two reports of the delayed time from a
* parallel worker to the leader. The goal is to avoid the leader to be
* interrupted too frequently while it might be sleeping for cost delay.
- *
- * Note that unsent delayed_time will disappear when the parallel workers exit
- * because they are not considered in parallel_vacuum_end(). That's an acceptable
- * error for the use cases.
*/
#define WORKER_REPORT_DELAY_INTERVAL 1000
@@ -121,7 +117,7 @@ int VacuumCostBalanceLocal = 0;
static instr_time last_report_time;
/* total nap time between two reports */
-static double nap_time_since_last_report = 0;
+double nap_time_since_last_report = 0;
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..a09b655a13 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,13 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /*
+ * Report the amount of time we slept (or we might get incomplete data due
+ * to WORKER_REPORT_DELAY_INTERVAL).
+ */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..7a3ff07ec0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -312,6 +312,7 @@ extern PGDLLIMPORT int VacuumCostBalanceLocal;
extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT double nap_time_since_last_report;
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
--
2.34.1
On 2024-12-06 18:31, Bertrand Drouvot wrote:
Hi,
On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
Yeah, people would likely use this new field to monitor long running
vacuum.
Long enough that this error should be acceptable. Do you agree?OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
(0001 is
same as for v8), so I think we should provide it.
Thanks! The patch looks good to me.
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
On Mon, Dec 9, 2024 at 2:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
On 2024-12-06 18:31, Bertrand Drouvot wrote:
Hi,
On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
Yeah, people would likely use this new field to monitor long running
vacuum.
Long enough that this error should be acceptable. Do you agree?OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
(0001 is
same as for v8), so I think we should provide it.
This Idea looks good to me. Here are some comments
1.
+ Total amount of time spent in milliseconds waiting due to
<xref linkend="guc-vacuum-cost-delay"/>
+ or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case
of parallel
+ vacuum the reported time is across all the workers and the leader. The
+ workers update the column no more frequently than once per second, so it
+ could show slightly old values.
+ </para></entry>
I think this waiting is influenced due to cost delay as well as cost
limit GUCs because here we are counting total wait time and that very
much depends upon how frequently we are waiting and that's completely
driven by cost limit. Isn't it?
2.
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) >
WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
Does it make sense to track this "nap_time_since_last_report" in a
shared variable instead of local in individual workers and whenever
the shared variable crosses WORKER_REPORT_DELAY_INTERVAL we can report
this would avoid individual reporting from different workers. Since
we are already computing the cost balance in shared variable i.e.
VacuumSharedCostBalance, or do you think it will complicate the code?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On Mon, Dec 09, 2024 at 05:18:30PM +0530, Dilip Kumar wrote:
On Mon, Dec 9, 2024 at 2:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
On 2024-12-06 18:31, Bertrand Drouvot wrote:
Hi,
On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
Yeah, people would likely use this new field to monitor long running
vacuum.
Long enough that this error should be acceptable. Do you agree?OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
(0001 is
same as for v8), so I think we should provide it.This Idea looks good to me.
Thanks for looking at it!
1. + Total amount of time spent in milliseconds waiting due to <xref linkend="guc-vacuum-cost-delay"/> + or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel + vacuum the reported time is across all the workers and the leader. The + workers update the column no more frequently than once per second, so it + could show slightly old values. + </para></entry>I think this waiting is influenced due to cost delay as well as cost
limit GUCs because here we are counting total wait time and that very
much depends upon how frequently we are waiting and that's completely
driven by cost limit. Isn't it?
Yeah. I think we could change the wording that way:
s/waiting due to/waiting during/
Does that make sense? I don't think we need to mention cost limit here.
2. + if (IsParallelWorker()) + { + instr_time time_since_last_report; + + INSTR_TIME_SET_ZERO(time_since_last_report); + INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, + last_report_time); + nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time); + + if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL) + { + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + nap_time_since_last_report); + nap_time_since_last_report = 0; + last_report_time = delay_end; + } + }Does it make sense to track this "nap_time_since_last_report" in a
shared variable instead of local in individual workers and whenever
the shared variable crosses WORKER_REPORT_DELAY_INTERVAL we can report
this would avoid individual reporting from different workers. Since
we are already computing the cost balance in shared variable i.e.
VacuumSharedCostBalance, or do you think it will complicate the code?
I'm not sure I follow. nap_time_since_last_report is the time a worker had to
wait since its last report. There is no direct relationship with
WORKER_REPORT_DELAY_INTERVAL (which is compared to time_since_last_report and
not nap_time_since_last_report).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Dec 9, 2024 at 6:55 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Mon, Dec 09, 2024 at 05:18:30PM +0530, Dilip Kumar wrote:
On Mon, Dec 9, 2024 at 2:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
On 2024-12-06 18:31, Bertrand Drouvot wrote:
Hi,
On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
Yeah, people would likely use this new field to monitor long running
vacuum.
Long enough that this error should be acceptable. Do you agree?OTOH, adding the 100% accuracy looks as simple as v9-0002 attached
(0001 is
same as for v8), so I think we should provide it.This Idea looks good to me.
Thanks for looking at it!
1. + Total amount of time spent in milliseconds waiting due to <xref linkend="guc-vacuum-cost-delay"/> + or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel + vacuum the reported time is across all the workers and the leader. The + workers update the column no more frequently than once per second, so it + could show slightly old values. + </para></entry>I think this waiting is influenced due to cost delay as well as cost
limit GUCs because here we are counting total wait time and that very
much depends upon how frequently we are waiting and that's completely
driven by cost limit. Isn't it?Yeah. I think we could change the wording that way:
s/waiting due to/waiting during/
Does that make sense? I don't think we need to mention cost limit here.
Yeah that should be fine.
2. + if (IsParallelWorker()) + { + instr_time time_since_last_report; + + INSTR_TIME_SET_ZERO(time_since_last_report); + INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, + last_report_time); + nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time); + + if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL) + { + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + nap_time_since_last_report); + nap_time_since_last_report = 0; + last_report_time = delay_end; + } + }Does it make sense to track this "nap_time_since_last_report" in a
shared variable instead of local in individual workers and whenever
the shared variable crosses WORKER_REPORT_DELAY_INTERVAL we can report
this would avoid individual reporting from different workers. Since
we are already computing the cost balance in shared variable i.e.
VacuumSharedCostBalance, or do you think it will complicate the code?I'm not sure I follow. nap_time_since_last_report is the time a worker had to
wait since its last report. There is no direct relationship with
WORKER_REPORT_DELAY_INTERVAL (which is compared to time_since_last_report and
not nap_time_since_last_report).
I mean currently we are tracking "time_since_last_report" and
accumulating the delayed_time in "nap_time_since_last_report" for each
worker. So my question was does it make sense to do this in a shared
variable across workers so that we can reduce the number of reports to the
leader?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On Mon, Dec 09, 2024 at 08:34:13PM +0530, Dilip Kumar wrote:
On Mon, Dec 9, 2024 at 6:55 PM Bertrand Drouvot
Yeah. I think we could change the wording that way:
s/waiting due to/waiting during/
Does that make sense? I don't think we need to mention cost limit here.
Yeah that should be fine.
Thanks! Done in v10 attached. BTW, 0001 and 0002 have been merged (thanks
Masahiro-san for having confirmed that v9-0002 made sense to you too!).
I mean currently we are tracking "time_since_last_report" and
accumulating the delayed_time in "nap_time_since_last_report" for each
worker. So my question was does it make sense to do this in a shared
variable across workers so that we can reduce the number of reports to the
leader?
I see. We've seen up-thread that the more we interrupt the leader the faster the
vacuum is (because the leader could be interrupted while waiting).
OTOH if we make use of shared variable then we'd need to add some "synchronization"
(pg_atomic_xxx) overhead. So we'd reduce the number of reports and add overhead.
So I think that it might be possible to see performance degradation in some cases
and so think it's safer to keep the "per worker" implementation.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v10-0001-Report-the-total-amount-of-time-that-vacuum-has-.patchtext/x-diff; charset=us-asciiDownload
From da69d66e20fbf00e92f57595b725e84cd1276fc3 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v10] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 13 +++++++
src/backend/catalog/system_views.sql | 2 +-
src/backend/commands/vacuum.c | 49 +++++++++++++++++++++++++++
src/backend/commands/vacuumparallel.c | 7 ++++
src/include/commands/progress.h | 1 +
src/include/commands/vacuum.h | 1 +
src/test/regress/expected/rules.out | 3 +-
7 files changed, 74 insertions(+), 2 deletions(-)
22.1% doc/src/sgml/
3.8% src/backend/catalog/
66.6% src/backend/commands/
3.5% src/include/commands/
3.7% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..f2aab9974c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>time_delayed</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/>
+ or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
+ vacuum the reported time is across all the workers and the leader. The
+ workers update the column no more frequently than once per second, so it
+ could show slightly old values.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..013bd06222 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed, S.param11 AS time_delayed
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..7ad42a9507 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -102,6 +109,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2402,13 +2419,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..a09b655a13 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,13 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /*
+ * Report the amount of time we slept (or we might get incomplete data due
+ * to WORKER_REPORT_DELAY_INTERVAL).
+ */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+ nap_time_since_last_report);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TIME_DELAYED 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..7a3ff07ec0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -312,6 +312,7 @@ extern PGDLLIMPORT int VacuumCostBalanceLocal;
extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT double nap_time_since_last_report;
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..8b1154efac 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2056,7 +2056,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ s.param11 AS time_delayed
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
On Mon, Dec 9, 2024 at 10:11 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Mon, Dec 09, 2024 at 08:34:13PM +0530, Dilip Kumar wrote:
On Mon, Dec 9, 2024 at 6:55 PM Bertrand Drouvot
Yeah. I think we could change the wording that way:
s/waiting due to/waiting during/
Does that make sense? I don't think we need to mention cost limit here.
Yeah that should be fine.
Thanks! Done in v10 attached. BTW, 0001 and 0002 have been merged (thanks
Masahiro-san for having confirmed that v9-0002 made sense to you too!).I mean currently we are tracking "time_since_last_report" and
accumulating the delayed_time in "nap_time_since_last_report" for each
worker. So my question was does it make sense to do this in a shared
variable across workers so that we can reduce the number of reports to the
leader?I see. We've seen up-thread that the more we interrupt the leader the faster the
vacuum is (because the leader could be interrupted while waiting).OTOH if we make use of shared variable then we'd need to add some "synchronization"
(pg_atomic_xxx) overhead. So we'd reduce the number of reports and add overhead.So I think that it might be possible to see performance degradation in some cases
and so think it's safer to keep the "per worker" implementation.
Okay, that makes sense. Thanks.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 09, 2024 at 04:41:03PM +0000, Bertrand Drouvot wrote:
+ <structfield>time_delayed</structfield> <type>bigint</type>
I think it's also worth considering names like total_delay and
cumulative_delay.
+ Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/> + or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel + vacuum the reported time is across all the workers and the leader. The + workers update the column no more frequently than once per second, so it + could show slightly old values.
I wonder if it makes sense to provide this value as an interval instead of
the number of milliseconds to make it more human-readable. I might also
suggest some changes to the description:
Total accumulated time spent sleeping due to the cost-based vacuum
delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit). This
includes the time that any associated parallel workers have slept, too.
However, parallel workers report their sleep time no more frequently
than once per second, so the reported value may be slightly stale.
--
nathan
On Tue, Dec 10, 2024 at 11:25 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Mon, Dec 09, 2024 at 04:41:03PM +0000, Bertrand Drouvot wrote:
+ <structfield>time_delayed</structfield> <type>bigint</type>
I think it's also worth considering names like total_delay and
cumulative_delay.
+1, I vote for total_delay
+ Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/> + or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel + vacuum the reported time is across all the workers and the leader. The + workers update the column no more frequently than once per second, so it + could show slightly old values.I wonder if it makes sense to provide this value as an interval instead of
the number of milliseconds to make it more human-readable. I might also
suggest some changes to the description:Total accumulated time spent sleeping due to the cost-based vacuum
delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit). This
includes the time that any associated parallel workers have slept, too.
However, parallel workers report their sleep time no more frequently
than once per second, so the reported value may be slightly stale.
This description looks good.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On Tue, Dec 10, 2024 at 11:55:41AM -0600, Nathan Bossart wrote:
On Mon, Dec 09, 2024 at 04:41:03PM +0000, Bertrand Drouvot wrote:
+ <structfield>time_delayed</structfield> <type>bigint</type>
I think it's also worth considering names like total_delay and
cumulative_delay.
That's fine by me. Then I think that total_delay is the way to go (I don't see
any existing "cumulative_").
+ Total amount of time spent in milliseconds waiting during <xref linkend="guc-vacuum-cost-delay"/> + or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel + vacuum the reported time is across all the workers and the leader. The + workers update the column no more frequently than once per second, so it + could show slightly old values.I wonder if it makes sense to provide this value as an interval instead of
the number of milliseconds to make it more human-readable.
Yeah we could do so, but that would mean:
1. Write a dedicated "pg_stat_get_progress_info()" function for VACUUM. Indeed,
the current pg_stat_get_progress_info() is shared across multiple "commands" and
then we wouldn't be able to change it's output types in pg_proc.dat.
Or
2. Make use of make_interval() in the pg_stat_progress_vacuum view creation.
I don't like 1. that much and given that that would be as simple as:
"
select make_interval(secs => time_delayed / 1000) from pg_stat_progress_vacuum;
"
for an end user to display an interval, I'm not sure we should provide an interval
by default.
That said, I agree that milliseconds is not really human-readable and
does not provide that much added value (except flexibility), so I'd vote for 2.
if you feel we should provide an interval by default.
I might also
suggest some changes to the description:Total accumulated time spent sleeping due to the cost-based vacuum
delay settings (e.g., vacuum_cost_delay, vacuum_cost_limit). This
includes the time that any associated parallel workers have slept, too.
However, parallel workers report their sleep time no more frequently
than once per second, so the reported value may be slightly stale.
Yeah I like it, thanks! Now, I'm wondering if we should not also add something
like this:
"
Since multiple workers can sleep simultaneously, the total sleep time can exceed
the actual duration of the vacuum operation.
"
As that could be surprising to see this behavior in action.
Thoughts?
I'll provide an updated patch version once we agree on the above points.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Dec 11, 2024 at 07:00:50AM +0000, Bertrand Drouvot wrote:
On Tue, Dec 10, 2024 at 11:55:41AM -0600, Nathan Bossart wrote:
I wonder if it makes sense to provide this value as an interval instead of
the number of milliseconds to make it more human-readable.Yeah we could do so, but that would mean:
1. Write a dedicated "pg_stat_get_progress_info()" function for VACUUM. Indeed,
the current pg_stat_get_progress_info() is shared across multiple "commands" and
then we wouldn't be able to change it's output types in pg_proc.dat.Or
2. Make use of make_interval() in the pg_stat_progress_vacuum view creation.
I don't like 1. that much and given that that would be as simple as:
"
select make_interval(secs => time_delayed / 1000) from pg_stat_progress_vacuum;
"for an end user to display an interval, I'm not sure we should provide an interval
by default.That said, I agree that milliseconds is not really human-readable and
does not provide that much added value (except flexibility), so I'd vote for 2.
if you feel we should provide an interval by default.
That's roughly what I had in mind.
Yeah I like it, thanks! Now, I'm wondering if we should not also add something
like this:"
Since multiple workers can sleep simultaneously, the total sleep time can exceed
the actual duration of the vacuum operation.
"As that could be surprising to see this behavior in action.
I'd vote for leaving that out, if for no other reason than it can be
deduced from the rest of the description.
--
nathan
Hi,
On Wed, Dec 11, 2024 at 10:40:04AM -0600, Nathan Bossart wrote:
That's roughly what I had in mind.
Thanks for confirming, done that way in v11 attached.
I'd vote for leaving that out, if for no other reason than it can be
deduced from the rest of the description.
Yeah, fair enough.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v11-0001-Report-the-total-amount-of-time-that-vacuum-has-.patchtext/x-diff; charset=us-asciiDownload
From 35e6075791000498ed05a7eb62fd34616957c4ce Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v11] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: total_delay to the pg_stat_progress_vacuum system
view to show the total accumulated time spent sleeping due to the cost-based
vacuum delay settings.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
3.3 the total_display column is an interval data type
XXX: Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 14 ++++++++
src/backend/catalog/system_views.sql | 3 +-
src/backend/commands/vacuum.c | 49 +++++++++++++++++++++++++++
src/backend/commands/vacuumparallel.c | 7 ++++
src/include/commands/progress.h | 1 +
src/include/commands/vacuum.h | 1 +
src/test/regress/expected/rules.out | 3 +-
7 files changed, 76 insertions(+), 2 deletions(-)
22.6% doc/src/sgml/
5.0% src/backend/catalog/
63.5% src/backend/commands/
3.3% src/include/commands/
5.3% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..995a35618d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,20 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>total_delay</structfield> <type>interval</type>
+ </para>
+ <para>
+ Total accumulated time spent sleeping due to the cost-based vacuum
+ delay settings (e.g., <xref linkend="guc-vacuum-cost-delay"/>,
+ <xref linkend="guc-vacuum-cost-limit"/>). This includes the time that
+ any associated parallel workers have slept, too. However, parallel workers
+ report their sleep time no more frequently than once per second, so the
+ reported value may be slightly stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..494b2e348d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ make_interval(secs => S.param11 / 1000) AS total_delay
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..e01444b417 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -102,6 +109,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2402,13 +2419,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TOTAL_DELAY,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TOTAL_DELAY,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..5efb546844 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,13 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /*
+ * Report the amount of time we slept (or we might get incomplete data due
+ * to WORKER_REPORT_DELAY_INTERVAL).
+ */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TOTAL_DELAY,
+ nap_time_since_last_report);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..28b5e16b5b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TOTAL_DELAY 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..7a3ff07ec0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -312,6 +312,7 @@ extern PGDLLIMPORT int VacuumCostBalanceLocal;
extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT double nap_time_since_last_report;
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..0329812a34 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2056,7 +2056,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ make_interval(secs => ((s.param11 / 1000))::double precision) AS total_delay
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
On Thu, Dec 12, 2024 at 04:36:21AM +0000, Bertrand Drouvot wrote:
--- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1222,7 +1222,8 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, S.param8 AS num_dead_item_ids, S.param9 AS indexes_total, - S.param10 AS indexes_processed + S.param10 AS indexes_processed, + make_interval(secs => S.param11 / 1000) AS total_delay FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid;
I think we need to cast one of the operands to "double precision" to avoid
chopping off the fractional part of the result of the division, which seems
important for this case since we are dealing with lots of small sleeps.
Otherwise, at a glance, I think this one is just about ready for commit.
--
nathan
Hi,
On Thu, Dec 12, 2024 at 10:15:18AM -0600, Nathan Bossart wrote:
On Thu, Dec 12, 2024 at 04:36:21AM +0000, Bertrand Drouvot wrote:
--- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1222,7 +1222,8 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, S.param8 AS num_dead_item_ids, S.param9 AS indexes_total, - S.param10 AS indexes_processed + S.param10 AS indexes_processed, + make_interval(secs => S.param11 / 1000) AS total_delay FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid;I think we need to cast one of the operands to "double precision" to avoid
chopping off the fractional part of the result of the division, which seems
important for this case since we are dealing with lots of small sleeps.
Makes sense, done in the attached.
Otherwise, at a glance, I think this one is just about ready for commit.
Thanks for looking at it!
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v12-0001-Report-the-total-amount-of-time-that-vacuum-has-.patchtext/x-diff; charset=us-asciiDownload
From 1c9223247b4b89ecc4f50300a2b5fcfa3c806dcd Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v12] Report the total amount of time that vacuum has been
delayed due to cost delay
This commit adds one column: total_delay to the pg_stat_progress_vacuum system
view to show the total accumulated time spent sleeping due to the cost-based
vacuum delay settings.
This uses the new parallel message type for progress reporting added
by f1889729dd.
In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.
Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.
Indeed when deciding about the throttling:
1. The number of parallel workers should not come into play:
1.1) the more parallel workers is used, the less the impact of the leader on
the vacuum index phase duration/workload is (because the repartition is done
on more processes).
1.2) the less parallel workers is, the less the leader will be interrupted (
less parallel workers would report their delayed time).
2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).
3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.
3. A 1 second reporting "throttling" looks a reasonable threshold as:
3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.
3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).
XXX: Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
doc/src/sgml/monitoring.sgml | 14 ++++++++
src/backend/catalog/system_views.sql | 3 +-
src/backend/commands/vacuum.c | 49 +++++++++++++++++++++++++++
src/backend/commands/vacuumparallel.c | 7 ++++
src/include/commands/progress.h | 1 +
src/include/commands/vacuum.h | 1 +
src/test/regress/expected/rules.out | 3 +-
7 files changed, 76 insertions(+), 2 deletions(-)
22.3% doc/src/sgml/
5.5% src/backend/catalog/
62.6% src/backend/commands/
3.3% src/include/commands/
6.0% src/test/regress/expected/
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..995a35618d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,20 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>total_delay</structfield> <type>interval</type>
+ </para>
+ <para>
+ Total accumulated time spent sleeping due to the cost-based vacuum
+ delay settings (e.g., <xref linkend="guc-vacuum-cost-delay"/>,
+ <xref linkend="guc-vacuum-cost-limit"/>). This includes the time that
+ any associated parallel workers have slept, too. However, parallel workers
+ report their sleep time no more frequently than once per second, so the
+ reported value may be slightly stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..bd97d70393 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ make_interval(secs => S.param11 / 1000::double precision) AS total_delay
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..e01444b417 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
/*
* GUC parameters
@@ -102,6 +109,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
int VacuumCostBalanceLocal = 0;
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+double nap_time_since_last_report = 0;
+
/* non-export function prototypes */
static List *expand_vacuum_rel(VacuumRelation *vrel,
MemoryContext vac_context, int options);
@@ -2402,13 +2419,45 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+ INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+ INSTR_TIME_SET_CURRENT(delay_end);
pgstat_report_wait_end();
+ /* Report the amount of time we slept */
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+ /* Parallel worker */
+ if (IsParallelWorker())
+ {
+ instr_time time_since_last_report;
+
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+ last_report_time);
+ nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+ if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TOTAL_DELAY,
+ nap_time_since_last_report);
+ nap_time_since_last_report = 0;
+ last_report_time = delay_end;
+ }
+ }
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_TOTAL_DELAY,
+ INSTR_TIME_GET_MILLISEC(delayed_time));
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..5efb546844 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,13 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /*
+ * Report the amount of time we slept (or we might get incomplete data due
+ * to WORKER_REPORT_DELAY_INTERVAL).
+ */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TOTAL_DELAY,
+ nap_time_since_last_report);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..28b5e16b5b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_TOTAL_DELAY 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..7a3ff07ec0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -312,6 +312,7 @@ extern PGDLLIMPORT int VacuumCostBalanceLocal;
extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT double nap_time_since_last_report;
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..2e48ccb024 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2056,7 +2056,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ make_interval(secs => ((s.param11)::double precision / (1000)::double precision)) AS total_delay
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Hello!
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
Maybe last_report_time would be better named worker_last_report_time? (It is not clear to me from the comment that the variable is not used by the leader or autovacuum worker at all)
+ /* Parallel worker */
+ if (IsParallelWorker())
I think this comment doesn't add value (just repeats the code), maybe delete it?
I was surprised that the patch does not add reporting for log_autovacuum_min_duration. But I see it was discussed earlier, great. (postponed for another topic&patch)
- code looks good
- docs pass (not a native English speaker)
- check-world pass
regards, Sergei
I spent some time preparing v12 for commit and made the following larger
changes:
* I renamed the column to delay_time and changed it back to repoting
milliseconds to match other stats views like pg_stat_io.
* I optimized the code in vacuum_delay_point a bit. Notably, we're now
just storing the nanoseconds value in the pgstat param, so we now have to
divide by 1,000,000 in the views.
* I added a track_cost_delay_timing parameter that is off by default. The
new timing code is only used when this parameter is turned on. This is
meant to match parameters like track_io_timing. I felt that this was
important since this is relatively hot code.
* I also added delay_time to pg_stat_progress_analyze. It seems to use the
same vacuum_delay_point() function, so we actually need to do a bit of
refactoring to make sure the right pgstat param is incremented.
I think this has been discussed in the thread a bit already, but I do think
we should consider also adding this information to the vacuum/analyze log
messages and to the output of VACUUM/ANALYZE (VERBOSE). That needn't hold
up this patch, though.
Finally, I can't help but feel that the way we are adding this information
is a bit weird, both in how we are doing it and where we are presenting the
results. I don't see any reason that pgstat_progress_incr_param() and
friends can't handle this information, but I don't see any existing uses
for timing information. Plus, IMHO it's debatable whether the delay time
is really "progress" information, although I haven't thought of a better
place (existing or new) for it.
Thoughts?
--
nathan
Attachments:
v13-0001-Add-cost-based-delay-time-to-progress-views.patchtext/plain; charset=us-asciiDownload
From e1b91d3a1ac8700f2a29acd206f7c35e429dd2e4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 13 Dec 2024 21:37:39 -0600
Subject: [PATCH v13 1/1] Add cost-based delay time to progress views.
XXX: NEEDS CATVERSION BUMP
Author: Bertrand Drouvot
Reviewed-by: Sami Imseih, Robert Haas, Masahiko Sawada, Masahiro Ikeda, Dilip Kumar, Sergei Kornilov
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
contrib/file_fdw/file_fdw.c | 2 +-
doc/src/sgml/config.sgml | 24 ++++++
doc/src/sgml/monitoring.sgml | 27 ++++++
src/backend/catalog/system_views.sql | 6 +-
src/backend/commands/analyze.c | 10 +--
src/backend/commands/vacuum.c | 82 ++++++++++++++++++-
src/backend/commands/vacuumparallel.c | 5 ++
src/backend/tsearch/ts_typanalyze.c | 2 +-
src/backend/utils/adt/array_typanalyze.c | 2 +-
src/backend/utils/adt/rangetypes_typanalyze.c | 2 +-
src/backend/utils/misc/guc_tables.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/progress.h | 2 +
src/include/commands/vacuum.h | 4 +
src/test/regress/expected/rules.out | 6 +-
15 files changed, 169 insertions(+), 15 deletions(-)
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 1c81a7c073..088961b61b 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
for (;;)
{
/* Check for user-requested abort or sleep */
- vacuum_delay_point();
+ analyze_delay_point();
/* Fetch next row */
MemoryContextReset(tupcontext);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e0c8325a39..9c5d44b24d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8386,6 +8386,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-track-cost-delay-timing" xreflabel="track_cost_delay_timing">
+ <term><varname>track_cost_delay_timing</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_cost_delay_timing</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables timing of cost-based vacuum delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>). This parameter
+ is off by default, as it will repeatedly query the operating system for
+ the current time, which may cause significant overhead on some
+ platforms. You can use the <xref linkend="pgtesttiming"/> tool to
+ measure the overhead of timing on your system. Cost-based vacuum delay
+ timing information is displayed in
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
+ and
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege and change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
<term><varname>track_io_timing</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..4cf8486225 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5503,6 +5503,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>acquiring inherited sample rows</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero).
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -6428,6 +6440,21 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero). This includes the time that any associated parallel workers have
+ slept. However, parallel workers report their sleep time no more
+ frequently than once per second, so the reported value may be slightly
+ stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..d969e2ac40 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1202,7 +1202,8 @@ CREATE VIEW pg_stat_progress_analyze AS
S.param5 AS ext_stats_computed,
S.param6 AS child_tables_total,
S.param7 AS child_tables_done,
- CAST(S.param8 AS oid) AS current_child_table_relid
+ CAST(S.param8 AS oid) AS current_child_table_relid,
+ S.param9 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('ANALYZE') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1222,7 +1223,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282..7656f92bee 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -913,7 +913,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
HeapTuple heapTuple = rows[rowno];
- vacuum_delay_point();
+ analyze_delay_point();
/*
* Reset the per-tuple context each time, to reclaim any cruft
@@ -1232,7 +1232,7 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
- vacuum_delay_point();
+ analyze_delay_point();
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
@@ -1964,7 +1964,7 @@ compute_trivial_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2080,7 +2080,7 @@ compute_distinct_stats(VacAttrStatsP stats,
int firstcount1,
j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2427,7 +2427,7 @@ compute_scalar_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..1c2b0c5932 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum interval for cost-based vacuum delay reports from a parallel worker.
+ * This aims to avoid sending too many messages and waking up the leader too
+ * frequently.
+ */
+#define PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
/*
* GUC parameters
@@ -69,6 +76,7 @@ int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
int vacuum_failsafe_age;
int vacuum_multixact_failsafe_age;
+bool track_cost_delay_timing;
/*
* Variables for cost-based vacuum delay. The defaults differ between
@@ -79,6 +87,11 @@ int vacuum_multixact_failsafe_age;
double vacuum_cost_delay = 0;
int vacuum_cost_limit = 200;
+/*
+ * Variable for reporting cost-based vacuum delay from parallel workers.
+ */
+int64 parallel_vacuum_worker_delay_ns = 0;
+
/*
* VacuumFailsafeActive is a defined as a global so that we can determine
* whether or not to re-enable cost-based vacuum delay when vacuuming a table.
@@ -2358,8 +2371,8 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
* This should be called in each major loop of VACUUM processing,
* typically once per page processed.
*/
-void
-vacuum_delay_point(void)
+static void
+vacuum_delay_point_internal(bool is_analyze)
{
double msec = 0;
@@ -2402,13 +2415,66 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
+ if (track_cost_delay_timing)
+ INSTR_TIME_SET_CURRENT(delay_start);
+
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ if (track_cost_delay_timing)
+ {
+ instr_time delay_end;
+ instr_time delay;
+
+ INSTR_TIME_SET_CURRENT(delay_end);
+ INSTR_TIME_SET_ZERO(delay);
+ INSTR_TIME_ACCUM_DIFF(delay, delay_end, delay_start);
+
+ /*
+ * For parallel workers, we only report the delay time every once
+ * in a while to avoid overloading the leader with messages and
+ * interrupts.
+ */
+ if (IsParallelWorker())
+ {
+ static instr_time last_report_time;
+ instr_time time_since_last_report;
+
+ Assert(!is_analyze);
+
+ /* accumulate the delay time */
+ parallel_vacuum_worker_delay_ns += INSTR_TIME_GET_NANOSEC(delay);
+
+ /* calculate interval since last report */
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, last_report_time);
+
+ /* if we haven't reported in a while, do so now */
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
+ /* reset variables */
+ last_report_time = delay_end;
+ parallel_vacuum_worker_delay_ns = 0;
+ }
+ }
+ else if (is_analyze)
+ pgstat_progress_incr_param(PROGRESS_ANALYZE_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ }
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
@@ -2435,6 +2501,18 @@ vacuum_delay_point(void)
}
}
+void
+vacuum_delay_point(void)
+{
+ vacuum_delay_point_internal(false);
+}
+
+void
+analyze_delay_point(void)
+{
+ vacuum_delay_point_internal(true);
+}
+
/*
* Computes the vacuum delay for parallel workers.
*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..ea5940e299 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,11 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /* Report any remaining cost-based vacuum delay time */
+ if (track_cost_delay_timing)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index ccafe42729..eff80980ab 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -204,7 +204,7 @@ compute_tsvector_stats(VacAttrStats *stats,
char *lexemesptr;
int j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, vector_no, &isnull);
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 2c633bee6b..6491be8b3b 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -314,7 +314,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
int distinct_count;
bool count_item_found;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, array_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 3773f98115..1567ceba23 100644
--- a/src/backend/utils/adt/rangetypes_typanalyze.c
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -167,7 +167,7 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
upper;
float8 length;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, range_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad2..031fcc43e3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1469,6 +1469,15 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"track_cost_delay_timing", PGC_SUSET, STATS_CUMULATIVE,
+ gettext_noop("Collects timing statistics for cost-based vacuum delay."),
+ NULL
+ },
+ &track_cost_delay_timing,
+ false,
+ NULL, NULL, NULL
+ },
{
{"track_io_timing", PGC_SUSET, STATS_CUMULATIVE,
gettext_noop("Collects timing statistics for database I/O activity."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a2ac7575ca..c7fd4f26f0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -640,6 +640,7 @@
#track_activities = on
#track_activity_query_size = 1024 # (change requires restart)
#track_counts = on
+#track_cost_delay_timing = off
#track_io_timing = off
#track_wal_io_timing = off
#track_functions = none # none, pl, all
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..df862192a6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_DELAY_TIME 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
@@ -46,6 +47,7 @@
#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL 5
#define PROGRESS_ANALYZE_CHILD_TABLES_DONE 6
#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID 7
+#define PROGRESS_ANALYZE_DELAY_TIME 8
/* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
#define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..f3f0abc87f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -296,6 +296,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool track_cost_delay_timing;
/*
* Maximum value for default_statistics_target and per-column statistics
@@ -313,6 +314,8 @@ extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT int64 parallel_vacuum_worker_delay_ns;
+
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
extern void vacuum(List *relations, VacuumParams *params,
@@ -340,6 +343,7 @@ extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
extern void vac_update_datfrozenxid(void);
extern void vacuum_delay_point(void);
+extern void analyze_delay_point(void);
extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
bits32 options);
extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..4796bbd01c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1926,7 +1926,8 @@ pg_stat_progress_analyze| SELECT s.pid,
s.param5 AS ext_stats_computed,
s.param6 AS child_tables_total,
s.param7 AS child_tables_done,
- (s.param8)::oid AS current_child_table_relid
+ (s.param8)::oid AS current_child_table_relid,
+ ((s.param9)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('ANALYZE'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_basebackup| SELECT pid,
@@ -2056,7 +2057,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ ((s.param11)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.39.5 (Apple Git-154)
Hi
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege and change this setting.
a typo? should be "can change"?
I like the separation of vacuum_delay_point and analyze_delay_point, it improves the readability of the analyze code. Looks good. I would like to enable track_cost_delay_timing by default, but the analogy with track_io_timing is good... I agree that it is better to have it off by default.
regards, Sergeii
Hi,
On Fri, Dec 13, 2024 at 10:06:08PM -0600, Nathan Bossart wrote:
I spent some time preparing v12 for commit and made the following larger
changes:
Thanks!
* I renamed the column to delay_time and changed it back to repoting
milliseconds to match other stats views like pg_stat_io.
Okay better to be consistent.
* I optimized the code in vacuum_delay_point a bit. Notably, we're now
just storing the nanoseconds value in the pgstat param,
Right, using nanoseconds induces less computation/conversions in the C code.
so we now have to divide by 1,000,000 in the views.
So, reading the output one could get the number of nanoseconds waiting on
cost delay. I'm not sure it's needed to give this level of precision for the
delay time (while I fully agree it has to be done for the I/O related timing).
OTOH, it does not hurt to give this level of precision, so I'm fine with it.
* I added a track_cost_delay_timing parameter that is off by default. The
new timing code is only used when this parameter is turned on. This is
meant to match parameters like track_io_timing. I felt that this was
important since this is relatively hot code.
Fully agree.
* I also added delay_time to pg_stat_progress_analyze. It seems to use the
same vacuum_delay_point() function, so we actually need to do a bit of
refactoring to make sure the right pgstat param is incremented.
Good idea!
I think this has been discussed in the thread a bit already, but I do think
we should consider also adding this information to the vacuum/analyze log
messages and to the output of VACUUM/ANALYZE (VERBOSE). That needn't hold
up this patch, though.
Yes, that would be a nice next step to do.
Finally, I can't help but feel that the way we are adding this information
is a bit weird, both in how we are doing it and where we are presenting the
results. I don't see any reason that pgstat_progress_incr_param() and
friends can't handle this information, but I don't see any existing uses
for timing information. Plus, IMHO it's debatable whether the delay time
is really "progress" information, although I haven't thought of a better
place (existing or new) for it.
I agree that pgstat_progress_incr_param() was originally designed for progress
counters rather than timing data. I also agree that "it's not real" progress
information. I see it more like "let's look at it while checking the progress".
I think that we have is a pragmatic approach: use the existing progress reporting
system even though it's not a perfect conceptual fit, rather than creating new
"infrastructure" just for this timing data.
I think that's fine as we could still change our mind should new "timing data"
be added in the future.
A few comments about the patch:
=== 1
+ /* accumulate the delay time */
s/accumulate/Accumulate/ to be consistent with the code around. Did it that
way in v14 attached and for other places.
=== 2
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
The output looks like "167.100142". As said above, I'm not sure it's needed to
give this level of precision for the delay time. But that does not hurt.
=== 3
+#define PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
Did not changed in v14, but "PARALLEL_VACUUM_REPORT_INTERVAL_NS" could be
an option as well. I think it keeps the key concepts while being more concise (
WORKER is somehow implicit in the context).
=== 4
-vacuum_delay_point(void)
+static void
+vacuum_delay_point_internal(bool is_analyze)
Updated the comment on top of it accordingly.
=== 5
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
Added a comment to mention that PROGRESS_ANALYZE_DELAY_TIME is out of interest
here.
v14 also fixes the typo mentioned by Sergei in [1]/messages/by-id/1983281734169163@sjg23nxaikj7vz54.iva.yp-c.yandex.net.
[1]: /messages/by-id/1983281734169163@sjg23nxaikj7vz54.iva.yp-c.yandex.net
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v14-0001-Add-cost-based-delay-time-to-progress-views.patchtext/x-diff; charset=us-asciiDownload
From 7b23ec295cf5d74acba13e72e3142af7b2ffe423 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 13 Dec 2024 21:37:39 -0600
Subject: [PATCH v14] Add cost-based delay time to progress views.
XXX: NEEDS CATVERSION BUMP
Author: Bertrand Drouvot
Reviewed-by: Sami Imseih, Robert Haas, Masahiko Sawada, Masahiro Ikeda, Dilip Kumar, Sergei Kornilov
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
contrib/file_fdw/file_fdw.c | 2 +-
doc/src/sgml/config.sgml | 24 ++++++
doc/src/sgml/monitoring.sgml | 27 ++++++
src/backend/catalog/system_views.sql | 6 +-
src/backend/commands/analyze.c | 10 +--
src/backend/commands/vacuum.c | 85 ++++++++++++++++++-
src/backend/commands/vacuumparallel.c | 5 ++
src/backend/tsearch/ts_typanalyze.c | 2 +-
src/backend/utils/adt/array_typanalyze.c | 2 +-
src/backend/utils/adt/rangetypes_typanalyze.c | 2 +-
src/backend/utils/misc/guc_tables.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/progress.h | 2 +
src/include/commands/vacuum.h | 4 +
src/test/regress/expected/rules.out | 6 +-
15 files changed, 171 insertions(+), 16 deletions(-)
36.4% doc/src/sgml/
4.8% src/backend/catalog/
43.2% src/backend/commands/
3.7% src/backend/utils/misc/
3.5% src/include/commands/
5.0% src/test/regress/expected/
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 1c81a7c073..088961b61b 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
for (;;)
{
/* Check for user-requested abort or sleep */
- vacuum_delay_point();
+ analyze_delay_point();
/* Fetch next row */
MemoryContextReset(tupcontext);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e0c8325a39..c1a34b3870 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8386,6 +8386,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-track-cost-delay-timing" xreflabel="track_cost_delay_timing">
+ <term><varname>track_cost_delay_timing</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_cost_delay_timing</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables timing of cost-based vacuum delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>). This parameter
+ is off by default, as it will repeatedly query the operating system for
+ the current time, which may cause significant overhead on some
+ platforms. You can use the <xref linkend="pgtesttiming"/> tool to
+ measure the overhead of timing on your system. Cost-based vacuum delay
+ timing information is displayed in
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
+ and
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
<term><varname>track_io_timing</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..4cf8486225 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5503,6 +5503,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>acquiring inherited sample rows</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero).
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -6428,6 +6440,21 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero). This includes the time that any associated parallel workers have
+ slept. However, parallel workers report their sleep time no more
+ frequently than once per second, so the reported value may be slightly
+ stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..d969e2ac40 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1202,7 +1202,8 @@ CREATE VIEW pg_stat_progress_analyze AS
S.param5 AS ext_stats_computed,
S.param6 AS child_tables_total,
S.param7 AS child_tables_done,
- CAST(S.param8 AS oid) AS current_child_table_relid
+ CAST(S.param8 AS oid) AS current_child_table_relid,
+ S.param9 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('ANALYZE') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1222,7 +1223,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282..7656f92bee 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -913,7 +913,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
HeapTuple heapTuple = rows[rowno];
- vacuum_delay_point();
+ analyze_delay_point();
/*
* Reset the per-tuple context each time, to reclaim any cruft
@@ -1232,7 +1232,7 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
- vacuum_delay_point();
+ analyze_delay_point();
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
@@ -1964,7 +1964,7 @@ compute_trivial_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2080,7 +2080,7 @@ compute_distinct_stats(VacAttrStatsP stats,
int firstcount1,
j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2427,7 +2427,7 @@ compute_scalar_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..4af32bfebf 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum interval for cost-based vacuum delay reports from a parallel worker.
+ * This aims to avoid sending too many messages and waking up the leader too
+ * frequently.
+ */
+#define PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
/*
* GUC parameters
@@ -69,6 +76,7 @@ int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
int vacuum_failsafe_age;
int vacuum_multixact_failsafe_age;
+bool track_cost_delay_timing;
/*
* Variables for cost-based vacuum delay. The defaults differ between
@@ -79,6 +87,11 @@ int vacuum_multixact_failsafe_age;
double vacuum_cost_delay = 0;
int vacuum_cost_limit = 200;
+/*
+ * Variable for reporting cost-based vacuum delay from parallel workers.
+ */
+int64 parallel_vacuum_worker_delay_ns = 0;
+
/*
* VacuumFailsafeActive is a defined as a global so that we can determine
* whether or not to re-enable cost-based vacuum delay when vacuuming a table.
@@ -2353,13 +2366,13 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
}
/*
- * vacuum_delay_point --- check for interrupts and cost-based delay.
+ * vacuum_delay_point_internal --- check for interrupts and cost-based delay.
*
* This should be called in each major loop of VACUUM processing,
* typically once per page processed.
*/
-void
-vacuum_delay_point(void)
+static void
+vacuum_delay_point_internal(bool is_analyze)
{
double msec = 0;
@@ -2402,13 +2415,67 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
+ if (track_cost_delay_timing)
+ INSTR_TIME_SET_CURRENT(delay_start);
+
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ if (track_cost_delay_timing)
+ {
+ instr_time delay_end;
+ instr_time delay;
+
+ INSTR_TIME_SET_CURRENT(delay_end);
+ INSTR_TIME_SET_ZERO(delay);
+ INSTR_TIME_ACCUM_DIFF(delay, delay_end, delay_start);
+
+ /*
+ * For parallel workers, we only report the delay time every once
+ * in a while to avoid overloading the leader with messages and
+ * interrupts.
+ */
+ if (IsParallelWorker())
+ {
+ static instr_time last_report_time;
+ instr_time time_since_last_report;
+
+ Assert(!is_analyze);
+
+ /* Accumulate the delay time */
+ parallel_vacuum_worker_delay_ns += INSTR_TIME_GET_NANOSEC(delay);
+
+ /* Calculate interval since last report */
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, last_report_time);
+
+ /* If we haven't reported in a while, do so now */
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS)
+ {
+ /* PROGRESS_ANALYZE_DELAY_TIME can't be of interest */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
+ /* Reset variables */
+ last_report_time = delay_end;
+ parallel_vacuum_worker_delay_ns = 0;
+ }
+ }
+ else if (is_analyze)
+ pgstat_progress_incr_param(PROGRESS_ANALYZE_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ }
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
@@ -2435,6 +2502,18 @@ vacuum_delay_point(void)
}
}
+void
+vacuum_delay_point(void)
+{
+ vacuum_delay_point_internal(false);
+}
+
+void
+analyze_delay_point(void)
+{
+ vacuum_delay_point_internal(true);
+}
+
/*
* Computes the vacuum delay for parallel workers.
*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..ea5940e299 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,11 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /* Report any remaining cost-based vacuum delay time */
+ if (track_cost_delay_timing)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index ccafe42729..eff80980ab 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -204,7 +204,7 @@ compute_tsvector_stats(VacAttrStats *stats,
char *lexemesptr;
int j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, vector_no, &isnull);
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 2c633bee6b..6491be8b3b 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -314,7 +314,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
int distinct_count;
bool count_item_found;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, array_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 3773f98115..1567ceba23 100644
--- a/src/backend/utils/adt/rangetypes_typanalyze.c
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -167,7 +167,7 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
upper;
float8 length;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, range_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad2..031fcc43e3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1469,6 +1469,15 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"track_cost_delay_timing", PGC_SUSET, STATS_CUMULATIVE,
+ gettext_noop("Collects timing statistics for cost-based vacuum delay."),
+ NULL
+ },
+ &track_cost_delay_timing,
+ false,
+ NULL, NULL, NULL
+ },
{
{"track_io_timing", PGC_SUSET, STATS_CUMULATIVE,
gettext_noop("Collects timing statistics for database I/O activity."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a2ac7575ca..c7fd4f26f0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -640,6 +640,7 @@
#track_activities = on
#track_activity_query_size = 1024 # (change requires restart)
#track_counts = on
+#track_cost_delay_timing = off
#track_io_timing = off
#track_wal_io_timing = off
#track_functions = none # none, pl, all
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..df862192a6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_DELAY_TIME 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
@@ -46,6 +47,7 @@
#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL 5
#define PROGRESS_ANALYZE_CHILD_TABLES_DONE 6
#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID 7
+#define PROGRESS_ANALYZE_DELAY_TIME 8
/* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
#define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..f3f0abc87f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -296,6 +296,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool track_cost_delay_timing;
/*
* Maximum value for default_statistics_target and per-column statistics
@@ -313,6 +314,8 @@ extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT int64 parallel_vacuum_worker_delay_ns;
+
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
extern void vacuum(List *relations, VacuumParams *params,
@@ -340,6 +343,7 @@ extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
extern void vac_update_datfrozenxid(void);
extern void vacuum_delay_point(void);
+extern void analyze_delay_point(void);
extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
bits32 options);
extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..4796bbd01c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1926,7 +1926,8 @@ pg_stat_progress_analyze| SELECT s.pid,
s.param5 AS ext_stats_computed,
s.param6 AS child_tables_total,
s.param7 AS child_tables_done,
- (s.param8)::oid AS current_child_table_relid
+ (s.param8)::oid AS current_child_table_relid,
+ ((s.param9)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('ANALYZE'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_basebackup| SELECT pid,
@@ -2056,7 +2057,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ ((s.param11)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
On Mon, Dec 16, 2024 at 10:11:23AM +0000, Bertrand Drouvot wrote:
+#define PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
Did not changed in v14, but "PARALLEL_VACUUM_REPORT_INTERVAL_NS" could be
an option as well. I think it keeps the key concepts while being more concise (
WORKER is somehow implicit in the context).
I think it's important to keep "delay" somewhere in the name, so how about
PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS?
-vacuum_delay_point(void) +static void +vacuum_delay_point_internal(bool is_analyze)Updated the comment on top of it accordingly.
Thanks. I think we need to do some additional adjustments to this
commentary since external callers should now use
vacuum/analyze_delay_point().
--
nathan
Hi,
On Mon, Dec 16, 2024 at 04:02:56PM -0600, Nathan Bossart wrote:
On Mon, Dec 16, 2024 at 10:11:23AM +0000, Bertrand Drouvot wrote:
+#define PARALLEL_VACUUM_WORKER_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
Did not changed in v14, but "PARALLEL_VACUUM_REPORT_INTERVAL_NS" could be
an option as well. I think it keeps the key concepts while being more concise (
WORKER is somehow implicit in the context).I think it's important to keep "delay" somewhere in the name, so how about
PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS?
Yeah, sounds good to me (done in the attached).
-vacuum_delay_point(void) +static void +vacuum_delay_point_internal(bool is_analyze)Updated the comment on top of it accordingly.
Thanks. I think we need to do some additional adjustments to this
commentary since external callers should now use
vacuum/analyze_delay_point().
Agree, I gave it a try in the attached.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v15-0001-Add-cost-based-delay-time-to-progress-views.patchtext/x-diff; charset=us-asciiDownload
From ec98d08d407a89a52391921b00cd4267b5d07411 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 13 Dec 2024 21:37:39 -0600
Subject: [PATCH v15] Add cost-based delay time to progress views.
XXX: NEEDS CATVERSION BUMP
Author: Bertrand Drouvot
Reviewed-by: Sami Imseih, Robert Haas, Masahiko Sawada, Masahiro Ikeda, Dilip Kumar, Sergei Kornilov
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
contrib/file_fdw/file_fdw.c | 2 +-
doc/src/sgml/config.sgml | 24 ++++
doc/src/sgml/monitoring.sgml | 27 +++++
src/backend/catalog/system_views.sql | 6 +-
src/backend/commands/analyze.c | 10 +-
src/backend/commands/vacuum.c | 105 +++++++++++++++++-
src/backend/commands/vacuumparallel.c | 5 +
src/backend/tsearch/ts_typanalyze.c | 2 +-
src/backend/utils/adt/array_typanalyze.c | 2 +-
src/backend/utils/adt/rangetypes_typanalyze.c | 2 +-
src/backend/utils/misc/guc_tables.c | 9 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/progress.h | 2 +
src/include/commands/vacuum.h | 4 +
src/test/regress/expected/rules.out | 6 +-
15 files changed, 189 insertions(+), 18 deletions(-)
31.9% doc/src/sgml/
4.2% src/backend/catalog/
50.2% src/backend/commands/
3.2% src/backend/utils/misc/
3.1% src/include/commands/
4.4% src/test/regress/expected/
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 1c81a7c073..088961b61b 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
for (;;)
{
/* Check for user-requested abort or sleep */
- vacuum_delay_point();
+ analyze_delay_point();
/* Fetch next row */
MemoryContextReset(tupcontext);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e0c8325a39..c1a34b3870 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8386,6 +8386,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-track-cost-delay-timing" xreflabel="track_cost_delay_timing">
+ <term><varname>track_cost_delay_timing</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_cost_delay_timing</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables timing of cost-based vacuum delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>). This parameter
+ is off by default, as it will repeatedly query the operating system for
+ the current time, which may cause significant overhead on some
+ platforms. You can use the <xref linkend="pgtesttiming"/> tool to
+ measure the overhead of timing on your system. Cost-based vacuum delay
+ timing information is displayed in
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
+ and
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
<term><varname>track_io_timing</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..4cf8486225 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5503,6 +5503,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>acquiring inherited sample rows</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero).
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -6428,6 +6440,21 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero). This includes the time that any associated parallel workers have
+ slept. However, parallel workers report their sleep time no more
+ frequently than once per second, so the reported value may be slightly
+ stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..d969e2ac40 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1202,7 +1202,8 @@ CREATE VIEW pg_stat_progress_analyze AS
S.param5 AS ext_stats_computed,
S.param6 AS child_tables_total,
S.param7 AS child_tables_done,
- CAST(S.param8 AS oid) AS current_child_table_relid
+ CAST(S.param8 AS oid) AS current_child_table_relid,
+ S.param9 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('ANALYZE') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1222,7 +1223,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282..7656f92bee 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -913,7 +913,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
HeapTuple heapTuple = rows[rowno];
- vacuum_delay_point();
+ analyze_delay_point();
/*
* Reset the per-tuple context each time, to reclaim any cruft
@@ -1232,7 +1232,7 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
- vacuum_delay_point();
+ analyze_delay_point();
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
@@ -1964,7 +1964,7 @@ compute_trivial_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2080,7 +2080,7 @@ compute_distinct_stats(VacAttrStatsP stats,
int firstcount1,
j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2427,7 +2427,7 @@ compute_scalar_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..2f3b88f4d7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum interval for cost-based vacuum delay reports from a parallel worker.
+ * This aims to avoid sending too many messages and waking up the leader too
+ * frequently.
+ */
+#define PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
/*
* GUC parameters
@@ -69,6 +76,7 @@ int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
int vacuum_failsafe_age;
int vacuum_multixact_failsafe_age;
+bool track_cost_delay_timing;
/*
* Variables for cost-based vacuum delay. The defaults differ between
@@ -79,6 +87,11 @@ int vacuum_multixact_failsafe_age;
double vacuum_cost_delay = 0;
int vacuum_cost_limit = 200;
+/*
+ * Variable for reporting cost-based vacuum delay from parallel workers.
+ */
+int64 parallel_vacuum_worker_delay_ns = 0;
+
/*
* VacuumFailsafeActive is a defined as a global so that we can determine
* whether or not to re-enable cost-based vacuum delay when vacuuming a table.
@@ -2353,13 +2366,23 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
}
/*
- * vacuum_delay_point --- check for interrupts and cost-based delay.
+ * vacuum_delay_point_internal --- check for interrupts and cost-based delay.
+ *
+ * This should be called (through the vacuum_delay_point() or the analyze_delay_point()
+ * helpers) in each major loop of VACUUM/ANALYZE processing, typically once per
+ * page processed.
*
- * This should be called in each major loop of VACUUM processing,
- * typically once per page processed.
+ * If the track_cost_delay_timing GUC is set to on, the function tracks
+ * cumulative delay times and reports them as progress. In parallel vacuum workers,
+ * these times are accumulated and reported periodically to avoid overloading
+ * the leader with messages and interrupts. For non-parallel workers, each delay
+ * is reported immediately.
+ *
+ * The "is_analyze" parameter determines whether delays are attributed to vacuum
+ * or analyze operations in the progress reporting system.
*/
-void
-vacuum_delay_point(void)
+static void
+vacuum_delay_point_internal(bool is_analyze)
{
double msec = 0;
@@ -2402,13 +2425,67 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
+ if (track_cost_delay_timing)
+ INSTR_TIME_SET_CURRENT(delay_start);
+
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ if (track_cost_delay_timing)
+ {
+ instr_time delay_end;
+ instr_time delay;
+
+ INSTR_TIME_SET_CURRENT(delay_end);
+ INSTR_TIME_SET_ZERO(delay);
+ INSTR_TIME_ACCUM_DIFF(delay, delay_end, delay_start);
+
+ /*
+ * For parallel workers, we only report the delay time every once
+ * in a while to avoid overloading the leader with messages and
+ * interrupts.
+ */
+ if (IsParallelWorker())
+ {
+ static instr_time last_report_time;
+ instr_time time_since_last_report;
+
+ Assert(!is_analyze);
+
+ /* Accumulate the delay time */
+ parallel_vacuum_worker_delay_ns += INSTR_TIME_GET_NANOSEC(delay);
+
+ /* Calculate interval since last report */
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, last_report_time);
+
+ /* If we haven't reported in a while, do so now */
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS)
+ {
+ /* PROGRESS_ANALYZE_DELAY_TIME can't be of interest */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
+ /* Reset variables */
+ last_report_time = delay_end;
+ parallel_vacuum_worker_delay_ns = 0;
+ }
+ }
+ else if (is_analyze)
+ pgstat_progress_incr_param(PROGRESS_ANALYZE_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ }
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
@@ -2435,6 +2512,24 @@ vacuum_delay_point(void)
}
}
+/*
+ * Helper function to implement delay points in non-analyze operations.
+ */
+void
+vacuum_delay_point(void)
+{
+ vacuum_delay_point_internal(false);
+}
+
+/*
+ * Helper function to implement delay points in analyze operations.
+ */
+void
+analyze_delay_point(void)
+{
+ vacuum_delay_point_internal(true);
+}
+
/*
* Computes the vacuum delay for parallel workers.
*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..ea5940e299 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,11 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /* Report any remaining cost-based vacuum delay time */
+ if (track_cost_delay_timing)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index ccafe42729..eff80980ab 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -204,7 +204,7 @@ compute_tsvector_stats(VacAttrStats *stats,
char *lexemesptr;
int j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, vector_no, &isnull);
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 2c633bee6b..6491be8b3b 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -314,7 +314,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
int distinct_count;
bool count_item_found;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, array_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 3773f98115..1567ceba23 100644
--- a/src/backend/utils/adt/rangetypes_typanalyze.c
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -167,7 +167,7 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
upper;
float8 length;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, range_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad2..031fcc43e3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1469,6 +1469,15 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"track_cost_delay_timing", PGC_SUSET, STATS_CUMULATIVE,
+ gettext_noop("Collects timing statistics for cost-based vacuum delay."),
+ NULL
+ },
+ &track_cost_delay_timing,
+ false,
+ NULL, NULL, NULL
+ },
{
{"track_io_timing", PGC_SUSET, STATS_CUMULATIVE,
gettext_noop("Collects timing statistics for database I/O activity."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a2ac7575ca..c7fd4f26f0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -640,6 +640,7 @@
#track_activities = on
#track_activity_query_size = 1024 # (change requires restart)
#track_counts = on
+#track_cost_delay_timing = off
#track_io_timing = off
#track_wal_io_timing = off
#track_functions = none # none, pl, all
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..df862192a6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_DELAY_TIME 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
@@ -46,6 +47,7 @@
#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL 5
#define PROGRESS_ANALYZE_CHILD_TABLES_DONE 6
#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID 7
+#define PROGRESS_ANALYZE_DELAY_TIME 8
/* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
#define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..f3f0abc87f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -296,6 +296,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool track_cost_delay_timing;
/*
* Maximum value for default_statistics_target and per-column statistics
@@ -313,6 +314,8 @@ extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT int64 parallel_vacuum_worker_delay_ns;
+
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
extern void vacuum(List *relations, VacuumParams *params,
@@ -340,6 +343,7 @@ extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
extern void vac_update_datfrozenxid(void);
extern void vacuum_delay_point(void);
+extern void analyze_delay_point(void);
extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
bits32 options);
extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..4796bbd01c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1926,7 +1926,8 @@ pg_stat_progress_analyze| SELECT s.pid,
s.param5 AS ext_stats_computed,
s.param6 AS child_tables_total,
s.param7 AS child_tables_done,
- (s.param8)::oid AS current_child_table_relid
+ (s.param8)::oid AS current_child_table_relid,
+ ((s.param9)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('ANALYZE'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_basebackup| SELECT pid,
@@ -2056,7 +2057,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ ((s.param11)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
Barring objections, I am planning to commit this one soon. I might move
the addition of analyze_delay_point() to its own patch, but otherwise I
think it looks good to go.
--
nathan
Hi,
On Mon, Feb 03, 2025 at 02:05:51PM -0600, Nathan Bossart wrote:
Barring objections, I am planning to commit this one soon. I might move
the addition of analyze_delay_point() to its own patch, but otherwise I
think it looks good to go.
Yeah, I think that having analyze_delay_point() in its own patch makes sense.
It's done that way in the attached and allows 0002 to be focus on the main
feature.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v16-0001-Introduce-analyze_delay_point.patchtext/x-diff; charset=us-asciiDownload
From 0ba4f51ae63e9b8b4d6987d52465f2a284820d4d Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 4 Feb 2025 09:05:33 +0000
Subject: [PATCH v16 1/2] Introduce analyze_delay_point()
Currently vacuum_delay_point() is being used in analyze code paths.
This commit introduces analyze_delay_point() to make the analyze/vacuum split
clear. The "is_analyze" bool passed as a parameter to the new
vacuum_delay_point_internal is not being used (but will be used in a following
commit tracking the timing of cost-based vacuum/analyze delay).
---
contrib/file_fdw/file_fdw.c | 2 +-
src/backend/commands/analyze.c | 10 +++----
src/backend/commands/vacuum.c | 29 +++++++++++++++----
src/backend/tsearch/ts_typanalyze.c | 2 +-
src/backend/utils/adt/array_typanalyze.c | 2 +-
src/backend/utils/adt/rangetypes_typanalyze.c | 2 +-
src/include/commands/vacuum.h | 1 +
7 files changed, 34 insertions(+), 14 deletions(-)
3.9% contrib/file_fdw/
81.1% src/backend/commands/
3.9% src/backend/tsearch/
7.8% src/backend/utils/adt/
3.1% src/include/commands/
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 678e754b2b9..44dfb5c5a54 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
for (;;)
{
/* Check for user-requested abort or sleep */
- vacuum_delay_point();
+ analyze_delay_point();
/* Fetch next row */
MemoryContextReset(tupcontext);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f8da32e9aef..e177c6c3da5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -915,7 +915,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
HeapTuple heapTuple = rows[rowno];
- vacuum_delay_point();
+ analyze_delay_point();
/*
* Reset the per-tuple context each time, to reclaim any cruft
@@ -1234,7 +1234,7 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
- vacuum_delay_point();
+ analyze_delay_point();
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
@@ -1966,7 +1966,7 @@ compute_trivial_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2082,7 +2082,7 @@ compute_distinct_stats(VacAttrStatsP stats,
int firstcount1,
j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
@@ -2429,7 +2429,7 @@ compute_scalar_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, i, &isnull);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e6745e6145c..4563e617505 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2352,13 +2352,14 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
}
/*
- * vacuum_delay_point --- check for interrupts and cost-based delay.
+ * vacuum_delay_point_internal --- check for interrupts and cost-based delay.
*
- * This should be called in each major loop of VACUUM processing,
- * typically once per page processed.
+ * This should be called (through the vacuum_delay_point() or the analyze_delay_point()
+ * helpers) in each major loop of VACUUM/ANALYZE processing, typically once per
+ * page processed.
*/
-void
-vacuum_delay_point(void)
+static void
+vacuum_delay_point_internal(bool is_analyze)
{
double msec = 0;
@@ -2434,6 +2435,24 @@ vacuum_delay_point(void)
}
}
+/*
+ * Helper function to implement delay points in non-analyze operations.
+ */
+void
+vacuum_delay_point(void)
+{
+ vacuum_delay_point_internal(false);
+}
+
+/*
+ * Helper function to implement delay points in analyze operations.
+ */
+void
+analyze_delay_point(void)
+{
+ vacuum_delay_point_internal(true);
+}
+
/*
* Computes the vacuum delay for parallel workers.
*
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 1494da1c9d3..133ec743495 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -204,7 +204,7 @@ compute_tsvector_stats(VacAttrStats *stats,
char *lexemesptr;
int j;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, vector_no, &isnull);
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 44a6eb5dad0..0d1e0c7a582 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -314,7 +314,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
int distinct_count;
bool count_item_found;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, array_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 9dc73af1992..81e72a29d28 100644
--- a/src/backend/utils/adt/rangetypes_typanalyze.c
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -167,7 +167,7 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
upper;
float8 length;
- vacuum_delay_point();
+ analyze_delay_point();
value = fetchfunc(stats, range_no, &isnull);
if (isnull)
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950d..0d60bed0be4 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -340,6 +340,7 @@ extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
extern void vac_update_datfrozenxid(void);
extern void vacuum_delay_point(void);
+extern void analyze_delay_point(void);
extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
bits32 options);
extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
--
2.34.1
v16-0002-Add-cost-based-delay-time-to-progress-views.patchtext/x-diff; charset=us-asciiDownload
From 61592deef3218328a0accdb9c80f659d80e6060d Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 4 Feb 2025 09:52:10 +0000
Subject: [PATCH v16 2/2] Add cost-based delay time to progress views.
Author: Bertrand Drouvot
Reviewed-by: Sami Imseih, Robert Haas, Masahiko Sawada, Masahiro Ikeda, Dilip Kumar, Sergei Kornilov
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
doc/src/sgml/config.sgml | 24 ++++++
doc/src/sgml/monitoring.sgml | 27 +++++++
src/backend/catalog/system_views.sql | 6 +-
src/backend/commands/vacuum.c | 76 +++++++++++++++++++
src/backend/commands/vacuumparallel.c | 5 ++
src/backend/utils/misc/guc_tables.c | 9 +++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/progress.h | 2 +
src/include/commands/vacuum.h | 3 +
src/test/regress/expected/rules.out | 6 +-
10 files changed, 155 insertions(+), 4 deletions(-)
38.4% doc/src/sgml/
5.1% src/backend/catalog/
44.0% src/backend/commands/
3.9% src/backend/utils/misc/
3.0% src/include/commands/
5.3% src/test/regress/expected/
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a782f109982..eb96f2852e7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8246,6 +8246,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-track-cost-delay-timing" xreflabel="track_cost_delay_timing">
+ <term><varname>track_cost_delay_timing</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_cost_delay_timing</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables timing of cost-based vacuum delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>). This parameter
+ is off by default, as it will repeatedly query the operating system for
+ the current time, which may cause significant overhead on some
+ platforms. You can use the <xref linkend="pgtesttiming"/> tool to
+ measure the overhead of timing on your system. Cost-based vacuum delay
+ timing information is displayed in
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
+ and
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
<term><varname>track_io_timing</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edc2470bcf9..aa7e0bb677b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5606,6 +5606,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>acquiring inherited sample rows</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero).
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -6531,6 +6543,21 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero). This includes the time that any associated parallel workers have
+ slept. However, parallel workers report their sleep time no more
+ frequently than once per second, so the reported value may be slightly
+ stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cddc3ea9b53..eff0990957e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1213,7 +1213,8 @@ CREATE VIEW pg_stat_progress_analyze AS
S.param5 AS ext_stats_computed,
S.param6 AS child_tables_total,
S.param7 AS child_tables_done,
- CAST(S.param8 AS oid) AS current_child_table_relid
+ CAST(S.param8 AS oid) AS current_child_table_relid,
+ S.param9 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('ANALYZE') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1233,7 +1234,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 4563e617505..6bf9fc7161f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum interval for cost-based vacuum delay reports from a parallel worker.
+ * This aims to avoid sending too many messages and waking up the leader too
+ * frequently.
+ */
+#define PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
/*
* GUC parameters
@@ -69,6 +76,7 @@ int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
int vacuum_failsafe_age;
int vacuum_multixact_failsafe_age;
+bool track_cost_delay_timing;
/*
* Variables for cost-based vacuum delay. The defaults differ between
@@ -79,6 +87,11 @@ int vacuum_multixact_failsafe_age;
double vacuum_cost_delay = 0;
int vacuum_cost_limit = 200;
+/*
+ * Variable for reporting cost-based vacuum delay from parallel workers.
+ */
+int64 parallel_vacuum_worker_delay_ns = 0;
+
/*
* VacuumFailsafeActive is a defined as a global so that we can determine
* whether or not to re-enable cost-based vacuum delay when vacuuming a table.
@@ -2357,6 +2370,15 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
* This should be called (through the vacuum_delay_point() or the analyze_delay_point()
* helpers) in each major loop of VACUUM/ANALYZE processing, typically once per
* page processed.
+ *
+ * If the track_cost_delay_timing GUC is set to on, the function tracks
+ * cumulative delay times and reports them as progress. In parallel vacuum workers,
+ * these times are accumulated and reported periodically to avoid overloading
+ * the leader with messages and interrupts. For non-parallel workers, each delay
+ * is reported immediately.
+ *
+ * The "is_analyze" parameter determines whether delays are attributed to vacuum
+ * or analyze operations in the progress reporting system.
*/
static void
vacuum_delay_point_internal(bool is_analyze)
@@ -2402,13 +2424,67 @@ vacuum_delay_point_internal(bool is_analyze)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
+ if (track_cost_delay_timing)
+ INSTR_TIME_SET_CURRENT(delay_start);
+
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ if (track_cost_delay_timing)
+ {
+ instr_time delay_end;
+ instr_time delay;
+
+ INSTR_TIME_SET_CURRENT(delay_end);
+ INSTR_TIME_SET_ZERO(delay);
+ INSTR_TIME_ACCUM_DIFF(delay, delay_end, delay_start);
+
+ /*
+ * For parallel workers, we only report the delay time every once
+ * in a while to avoid overloading the leader with messages and
+ * interrupts.
+ */
+ if (IsParallelWorker())
+ {
+ static instr_time last_report_time;
+ instr_time time_since_last_report;
+
+ Assert(!is_analyze);
+
+ /* Accumulate the delay time */
+ parallel_vacuum_worker_delay_ns += INSTR_TIME_GET_NANOSEC(delay);
+
+ /* Calculate interval since last report */
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, last_report_time);
+
+ /* If we haven't reported in a while, do so now */
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS)
+ {
+ /* PROGRESS_ANALYZE_DELAY_TIME can't be of interest */
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
+ /* Reset variables */
+ last_report_time = delay_end;
+ parallel_vacuum_worker_delay_ns = 0;
+ }
+ }
+ else if (is_analyze)
+ pgstat_progress_incr_param(PROGRESS_ANALYZE_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ }
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index dc3322c256b..2b9d548cdeb 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1094,6 +1094,11 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /* Report any remaining cost-based vacuum delay time */
+ if (track_cost_delay_timing)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71448bb4fdd..4179c260e4d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1470,6 +1470,15 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"track_cost_delay_timing", PGC_SUSET, STATS_CUMULATIVE,
+ gettext_noop("Collects timing statistics for cost-based vacuum delay."),
+ NULL
+ },
+ &track_cost_delay_timing,
+ false,
+ NULL, NULL, NULL
+ },
{
{"track_io_timing", PGC_SUSET, STATS_CUMULATIVE,
gettext_noop("Collects timing statistics for database I/O activity."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 079efa1baa7..18e2d5d7608 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -632,6 +632,7 @@
#track_activities = on
#track_activity_query_size = 1024 # (change requires restart)
#track_counts = on
+#track_cost_delay_timing = off
#track_io_timing = off
#track_wal_io_timing = off
#track_functions = none # none, pl, all
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..7c736e7b03b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_DELAY_TIME 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
@@ -46,6 +47,7 @@
#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL 5
#define PROGRESS_ANALYZE_CHILD_TABLES_DONE 6
#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID 7
+#define PROGRESS_ANALYZE_DELAY_TIME 8
/* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
#define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 0d60bed0be4..ae0edbbeaf1 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -296,6 +296,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool track_cost_delay_timing;
/*
* Maximum value for default_statistics_target and per-column statistics
@@ -313,6 +314,8 @@ extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT int64 parallel_vacuum_worker_delay_ns;
+
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
extern void vacuum(List *relations, VacuumParams *params,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3361f6a69c9..5baba8d39ff 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1932,7 +1932,8 @@ pg_stat_progress_analyze| SELECT s.pid,
s.param5 AS ext_stats_computed,
s.param6 AS child_tables_total,
s.param7 AS child_tables_done,
- (s.param8)::oid AS current_child_table_relid
+ (s.param8)::oid AS current_child_table_relid,
+ ((s.param9)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('ANALYZE'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_basebackup| SELECT pid,
@@ -2062,7 +2063,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ ((s.param11)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
On Tue, Feb 04, 2025 at 10:14:48AM +0000, Bertrand Drouvot wrote:
On Mon, Feb 03, 2025 at 02:05:51PM -0600, Nathan Bossart wrote:
Barring objections, I am planning to commit this one soon. I might move
the addition of analyze_delay_point() to its own patch, but otherwise I
think it looks good to go.Yeah, I think that having analyze_delay_point() in its own patch makes sense.
It's done that way in the attached and allows 0002 to be focus on the main
feature.
Here is what I have prepared for commit. Other expanding the commit
messages, I've modified 0001 to just add a parameter to
vacuum_delay_point() to indicate whether this is a vacuum or analyze. I
was worried that adding an analyze_delay_point() could cause third-party
code to miss this change. We want such code to correctly indicate the type
of operation so that the progress views work for them, too.
Off-list, I've asked Bertrand to gauge the feasibility of adding this
information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE). IMHO
those are natural places to surface this information, and I want to ensure
that we're not painting ourselves into a corner with the approach we're
using for the progress views.
--
nathan
Attachments:
v17-0001-Add-is_analyze-parameter-to-vacuum_delay_point.patchtext/plain; charset=us-asciiDownload
From d1323d698b235a56514c9af0afbebed2db3032ef Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 10 Feb 2025 11:41:22 -0600
Subject: [PATCH v17 1/2] Add is_analyze parameter to vacuum_delay_point().
This function is used in both vacuum and analyze code paths, and a
follow-up commit will require distinguishing between the two. This
commit forces callers to declare whether they are being used for
vacuum or analyze, but it does not use that information for
anything yet.
Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
contrib/bloom/blvacuum.c | 4 ++--
contrib/file_fdw/file_fdw.c | 2 +-
src/backend/access/gin/ginfast.c | 6 +++---
src/backend/access/gin/ginvacuum.c | 6 +++---
src/backend/access/gist/gistvacuum.c | 2 +-
src/backend/access/hash/hash.c | 2 +-
src/backend/access/heap/vacuumlazy.c | 4 ++--
src/backend/access/nbtree/nbtree.c | 2 +-
src/backend/access/spgist/spgvacuum.c | 4 ++--
src/backend/commands/analyze.c | 10 +++++-----
src/backend/commands/vacuum.c | 2 +-
src/backend/tsearch/ts_typanalyze.c | 2 +-
src/backend/utils/adt/array_typanalyze.c | 2 +-
src/backend/utils/adt/rangetypes_typanalyze.c | 2 +-
src/include/commands/vacuum.h | 2 +-
15 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 7e1db0b52fc..86b15a75f6f 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -57,7 +57,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*itupPtr,
*itupEnd;
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, info->strategy);
@@ -187,7 +187,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
Buffer buffer;
Page page;
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, info->strategy);
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 678e754b2b9..0655bf532a0 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
for (;;)
{
/* Check for user-requested abort or sleep */
- vacuum_delay_point();
+ vacuum_delay_point(true);
/* Fetch next row */
MemoryContextReset(tupcontext);
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 4ab815fefe0..cc5d046c4b0 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -892,7 +892,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
*/
processPendingPage(&accum, &datums, page, FirstOffsetNumber);
- vacuum_delay_point();
+ vacuum_delay_point(false);
/*
* Is it time to flush memory to disk? Flush if we are at the end of
@@ -929,7 +929,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
{
ginEntryInsert(ginstate, attnum, key, category,
list, nlist, NULL);
- vacuum_delay_point();
+ vacuum_delay_point(false);
}
/*
@@ -1002,7 +1002,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
/*
* Read next page in pending list
*/
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, GIN_SHARE);
page = BufferGetPage(buffer);
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index d98c54b7cf7..533c37b3c5f 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -662,12 +662,12 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
UnlockReleaseBuffer(buffer);
}
- vacuum_delay_point();
+ vacuum_delay_point(false);
for (i = 0; i < nRoot; i++)
{
ginVacuumPostingTree(&gvs, rootOfPostingTree[i]);
- vacuum_delay_point();
+ vacuum_delay_point(false);
}
if (blkno == InvalidBlockNumber) /* rightmost page */
@@ -748,7 +748,7 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
Buffer buffer;
Page page;
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, info->strategy);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fe0bfb781ca..dd0d9d5006c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -283,7 +283,7 @@ restart:
recurse_to = InvalidBlockNumber;
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
info->strategy);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 63b568e7f24..4167b33e683 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -716,7 +716,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
bool retain_pin = false;
bool clear_dead_marking = false;
- vacuum_delay_point();
+ vacuum_delay_point(false);
page = BufferGetPage(buf);
opaque = HashPageGetOpaque(page);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 075af385cd1..e4d6d654c0a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -946,7 +946,7 @@ lazy_scan_heap(LVRelState *vacrel)
update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
blkno, InvalidOffsetNumber);
- vacuum_delay_point();
+ vacuum_delay_point(false);
/*
* Regularly check if wraparound failsafe should trigger.
@@ -2275,7 +2275,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
OffsetNumber offsets[MaxOffsetNumber];
int num_offsets;
- vacuum_delay_point();
+ vacuum_delay_point(false);
blkno = iter_result->blkno;
vacrel->blkno = blkno;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 971405e89af..dc244ae24c7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1137,7 +1137,7 @@ backtrack:
backtrack_to = P_NONE;
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
/*
* We can't use _bt_getbuf() here because it always applies
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..1c52f6528ad 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -625,7 +625,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
Page page;
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
@@ -704,7 +704,7 @@ spgprocesspending(spgBulkDeleteState *bds)
continue; /* ignore already-done items */
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
/* examine the referenced page */
blkno = ItemPointerGetBlockNumber(&pitem->tid);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e5ab207d2ec..e4302f4cdb2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -915,7 +915,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
HeapTuple heapTuple = rows[rowno];
- vacuum_delay_point();
+ vacuum_delay_point(true);
/*
* Reset the per-tuple context each time, to reclaim any cruft
@@ -1238,7 +1238,7 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
- vacuum_delay_point();
+ vacuum_delay_point(true);
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
@@ -1970,7 +1970,7 @@ compute_trivial_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, i, &isnull);
@@ -2086,7 +2086,7 @@ compute_distinct_stats(VacAttrStatsP stats,
int firstcount1,
j;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, i, &isnull);
@@ -2433,7 +2433,7 @@ compute_scalar_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, i, &isnull);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e6745e6145c..5e394c151c9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2358,7 +2358,7 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
* typically once per page processed.
*/
void
-vacuum_delay_point(void)
+vacuum_delay_point(bool is_analyze)
{
double msec = 0;
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 1494da1c9d3..c5a71331ce8 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -204,7 +204,7 @@ compute_tsvector_stats(VacAttrStats *stats,
char *lexemesptr;
int j;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, vector_no, &isnull);
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 44a6eb5dad0..6f61629b977 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -314,7 +314,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
int distinct_count;
bool count_item_found;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, array_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 9dc73af1992..a18196d8a34 100644
--- a/src/backend/utils/adt/rangetypes_typanalyze.c
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -167,7 +167,7 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
upper;
float8 length;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, range_no, &isnull);
if (isnull)
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950d..b884304dfe7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -339,7 +339,7 @@ extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
struct VacuumCutoffs *cutoffs);
extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
extern void vac_update_datfrozenxid(void);
-extern void vacuum_delay_point(void);
+extern void vacuum_delay_point(bool is_analyze);
extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
bits32 options);
extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
--
2.39.5 (Apple Git-154)
v17-0002-Add-cost-based-delay-time-to-progress-views.patchtext/plain; charset=us-asciiDownload
From b0789841d0353e57deb7054f42f0d401a3233bd3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 10 Feb 2025 14:25:02 -0600
Subject: [PATCH v17 2/2] Add cost-based delay time to progress views.
This commit adds the amount of time spent sleeping due to
cost-based delay to the pg_stat_progress_vacuum and
pg_stat_progress_analyze system views. A new configuration
parameter named track_cost_delay_timing, which is off by default,
controls whether this information is gathered. For vacuum, the
reported value includes the time that any associated parallel
workers have slept. However, parallel workers only report their
sleep time no more frequently than once per second to avoid
overloading the leader process.
XXX: NEEDS CATVERSION BUMP
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
doc/src/sgml/config.sgml | 24 +++++++
doc/src/sgml/monitoring.sgml | 27 ++++++++
src/backend/catalog/system_views.sql | 6 +-
src/backend/commands/vacuum.c | 64 +++++++++++++++++++
src/backend/commands/vacuumparallel.c | 5 ++
src/backend/utils/misc/guc_tables.c | 9 +++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/progress.h | 2 +
src/include/commands/vacuum.h | 3 +
src/test/regress/expected/rules.out | 6 +-
10 files changed, 143 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 38244409e3c..79a66ba7181 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8246,6 +8246,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-track-cost-delay-timing" xreflabel="track_cost_delay_timing">
+ <term><varname>track_cost_delay_timing</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_cost_delay_timing</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables timing of cost-based vacuum delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>). This parameter
+ is off by default, as it will repeatedly query the operating system for
+ the current time, which may cause significant overhead on some
+ platforms. You can use the <xref linkend="pgtesttiming"/> tool to
+ measure the overhead of timing on your system. Cost-based vacuum delay
+ timing information is displayed in
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
+ and
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
<term><varname>track_io_timing</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edc2470bcf9..928a6eb64b0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5606,6 +5606,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>acquiring inherited sample rows</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>, in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero).
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -6531,6 +6543,21 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero). This includes the time that any associated parallel workers have
+ slept. However, parallel workers report their sleep time no more
+ frequently than once per second, so the reported value may be slightly
+ stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cddc3ea9b53..eff0990957e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1213,7 +1213,8 @@ CREATE VIEW pg_stat_progress_analyze AS
S.param5 AS ext_stats_computed,
S.param6 AS child_tables_total,
S.param7 AS child_tables_done,
- CAST(S.param8 AS oid) AS current_child_table_relid
+ CAST(S.param8 AS oid) AS current_child_table_relid,
+ S.param9 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('ANALYZE') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1233,7 +1234,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5e394c151c9..4cba2e0a260 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum interval for cost-based vacuum delay reports from a parallel worker.
+ * This aims to avoid sending too many messages and waking up the leader too
+ * frequently
+ */
+#define PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
/*
* GUC parameters
@@ -69,6 +76,7 @@ int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
int vacuum_failsafe_age;
int vacuum_multixact_failsafe_age;
+bool track_cost_delay_timing;
/*
* Variables for cost-based vacuum delay. The defaults differ between
@@ -79,6 +87,9 @@ int vacuum_multixact_failsafe_age;
double vacuum_cost_delay = 0;
int vacuum_cost_limit = 200;
+/* Variable for reporting cost-based vacuum delay from parallel workers. */
+int64 parallel_vacuum_worker_delay_ns = 0;
+
/*
* VacuumFailsafeActive is a defined as a global so that we can determine
* whether or not to re-enable cost-based vacuum delay when vacuuming a table.
@@ -2401,13 +2412,66 @@ vacuum_delay_point(bool is_analyze)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
+ if (track_cost_delay_timing)
+ INSTR_TIME_SET_CURRENT(delay_start);
+
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ if (track_cost_delay_timing)
+ {
+ instr_time delay_end;
+ instr_time delay;
+
+ INSTR_TIME_SET_CURRENT(delay_end);
+ INSTR_TIME_SET_ZERO(delay);
+ INSTR_TIME_ACCUM_DIFF(delay, delay_end, delay_start);
+
+ /*
+ * For parallel workers, we only report the delay time every once
+ * in a while to avoid overloading the leader with messages and
+ * interrupts.
+ */
+ if (IsParallelWorker())
+ {
+ static instr_time last_report_time;
+ instr_time time_since_last_report;
+
+ Assert(!is_analyze);
+
+ /* Accumulate the delay time. */
+ parallel_vacuum_worker_delay_ns += INSTR_TIME_GET_NANOSEC(delay);
+
+ /* Calculate interval since last report. */
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, last_report_time);
+
+ /* If we haven't reported in a while, do so now. */
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
+ /* Reset variables. */
+ last_report_time = delay_end;
+ parallel_vacuum_worker_delay_ns = 0;
+ }
+ }
+ else if (is_analyze)
+ pgstat_progress_incr_param(PROGRESS_ANALYZE_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ }
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index dc3322c256b..2b9d548cdeb 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1094,6 +1094,11 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /* Report any remaining cost-based vacuum delay time */
+ if (track_cost_delay_timing)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ce7534d4d23..1efee7af176 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1470,6 +1470,15 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"track_cost_delay_timing", PGC_SUSET, STATS_CUMULATIVE,
+ gettext_noop("Collects timing statistics for cost-based vacuum delay."),
+ NULL
+ },
+ &track_cost_delay_timing,
+ false,
+ NULL, NULL, NULL
+ },
{
{"track_io_timing", PGC_SUSET, STATS_CUMULATIVE,
gettext_noop("Collects timing statistics for database I/O activity."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c40b7a3121e..6f77e5f8b26 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -632,6 +632,7 @@
#track_activities = on
#track_activity_query_size = 1024 # (change requires restart)
#track_counts = on
+#track_cost_delay_timing = off
#track_io_timing = off
#track_wal_io_timing = off
#track_functions = none # none, pl, all
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..7c736e7b03b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_DELAY_TIME 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
@@ -46,6 +47,7 @@
#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL 5
#define PROGRESS_ANALYZE_CHILD_TABLES_DONE 6
#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID 7
+#define PROGRESS_ANALYZE_DELAY_TIME 8
/* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
#define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b884304dfe7..b3eedf699af 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -296,6 +296,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool track_cost_delay_timing;
/*
* Maximum value for default_statistics_target and per-column statistics
@@ -313,6 +314,8 @@ extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT int64 parallel_vacuum_worker_delay_ns;
+
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
extern void vacuum(List *relations, VacuumParams *params,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3361f6a69c9..5baba8d39ff 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1932,7 +1932,8 @@ pg_stat_progress_analyze| SELECT s.pid,
s.param5 AS ext_stats_computed,
s.param6 AS child_tables_total,
s.param7 AS child_tables_done,
- (s.param8)::oid AS current_child_table_relid
+ (s.param8)::oid AS current_child_table_relid,
+ ((s.param9)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('ANALYZE'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_basebackup| SELECT pid,
@@ -2062,7 +2063,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ ((s.param11)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.39.5 (Apple Git-154)
Hi,
On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
Here is what I have prepared for commit. Other expanding the commit
messages, I've modified 0001 to just add a parameter to
vacuum_delay_point() to indicate whether this is a vacuum or analyze. I
was worried that adding an analyze_delay_point() could cause third-party
code to miss this change. We want such code to correctly indicate the type
of operation so that the progress views work for them, too.
Good point, that makes fully sense. v17 LGTM.
Off-list, I've asked Bertrand to gauge the feasibility of adding this
information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE). IMHO
those are natural places to surface this information, and I want to ensure
that we're not painting ourselves into a corner with the approach we're
using for the progress views.
Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
leader that is doing the report in case of parallel workers being used).
0001 and 0002 remain unchanged.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v17-0002-Add-cost-based-delay-time-to-progress-views.patchtext/x-diff; charset=us-asciiDownload
From 6b7966d2ca736a82345b87edb5fa3bf457d4d9bf Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 10 Feb 2025 14:25:02 -0600
Subject: [PATCH v17 2/3] Add cost-based delay time to progress views.
This commit adds the amount of time spent sleeping due to
cost-based delay to the pg_stat_progress_vacuum and
pg_stat_progress_analyze system views. A new configuration
parameter named track_cost_delay_timing, which is off by default,
controls whether this information is gathered. For vacuum, the
reported value includes the time that any associated parallel
workers have slept. However, parallel workers only report their
sleep time no more frequently than once per second to avoid
overloading the leader process.
XXX: NEEDS CATVERSION BUMP
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
doc/src/sgml/config.sgml | 24 +++++++
doc/src/sgml/monitoring.sgml | 27 ++++++++
src/backend/catalog/system_views.sql | 6 +-
src/backend/commands/vacuum.c | 64 +++++++++++++++++++
src/backend/commands/vacuumparallel.c | 5 ++
src/backend/utils/misc/guc_tables.c | 9 +++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/progress.h | 2 +
src/include/commands/vacuum.h | 3 +
src/test/regress/expected/rules.out | 6 +-
10 files changed, 143 insertions(+), 4 deletions(-)
42.1% doc/src/sgml/
5.6% src/backend/catalog/
38.5% src/backend/commands/
4.3% src/backend/utils/misc/
3.4% src/include/commands/
5.8% src/test/regress/expected/
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 38244409e3c..79a66ba7181 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8246,6 +8246,30 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-track-cost-delay-timing" xreflabel="track_cost_delay_timing">
+ <term><varname>track_cost_delay_timing</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_cost_delay_timing</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables timing of cost-based vacuum delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>). This parameter
+ is off by default, as it will repeatedly query the operating system for
+ the current time, which may cause significant overhead on some
+ platforms. You can use the <xref linkend="pgtesttiming"/> tool to
+ measure the overhead of timing on your system. Cost-based vacuum delay
+ timing information is displayed in
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
+ and
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ Only superusers and users with the appropriate <literal>SET</literal>
+ privilege can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
<term><varname>track_io_timing</varname> (<type>boolean</type>)
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edc2470bcf9..928a6eb64b0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5606,6 +5606,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>acquiring inherited sample rows</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>, in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero).
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -6531,6 +6543,21 @@ FROM pg_stat_get_backend_idset() AS backendid;
<literal>cleaning up indexes</literal>.
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>delay_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total time spent sleeping due to cost-based delay (see
+ <xref linkend="runtime-config-resource-vacuum-cost"/>), in milliseconds
+ (if <xref linkend="guc-track-cost-delay-timing"/> is enabled, otherwise
+ zero). This includes the time that any associated parallel workers have
+ slept. However, parallel workers report their sleep time no more
+ frequently than once per second, so the reported value may be slightly
+ stale.
+ </para></entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cddc3ea9b53..eff0990957e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1213,7 +1213,8 @@ CREATE VIEW pg_stat_progress_analyze AS
S.param5 AS ext_stats_computed,
S.param6 AS child_tables_total,
S.param7 AS child_tables_done,
- CAST(S.param8 AS oid) AS current_child_table_relid
+ CAST(S.param8 AS oid) AS current_child_table_relid,
+ S.param9 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('ANALYZE') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1233,7 +1234,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
- S.param10 AS indexes_processed
+ S.param10 AS indexes_processed,
+ S.param11 / 1000000::double precision AS delay_time
FROM pg_stat_get_progress_info('VACUUM') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5e394c151c9..4cba2e0a260 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -59,6 +60,12 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+/*
+ * Minimum interval for cost-based vacuum delay reports from a parallel worker.
+ * This aims to avoid sending too many messages and waking up the leader too
+ * frequently
+ */
+#define PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS (NS_PER_S)
/*
* GUC parameters
@@ -69,6 +76,7 @@ int vacuum_multixact_freeze_min_age;
int vacuum_multixact_freeze_table_age;
int vacuum_failsafe_age;
int vacuum_multixact_failsafe_age;
+bool track_cost_delay_timing;
/*
* Variables for cost-based vacuum delay. The defaults differ between
@@ -79,6 +87,9 @@ int vacuum_multixact_failsafe_age;
double vacuum_cost_delay = 0;
int vacuum_cost_limit = 200;
+/* Variable for reporting cost-based vacuum delay from parallel workers. */
+int64 parallel_vacuum_worker_delay_ns = 0;
+
/*
* VacuumFailsafeActive is a defined as a global so that we can determine
* whether or not to re-enable cost-based vacuum delay when vacuuming a table.
@@ -2401,13 +2412,66 @@ vacuum_delay_point(bool is_analyze)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
+ if (track_cost_delay_timing)
+ INSTR_TIME_SET_CURRENT(delay_start);
+
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
pg_usleep(msec * 1000);
pgstat_report_wait_end();
+ if (track_cost_delay_timing)
+ {
+ instr_time delay_end;
+ instr_time delay;
+
+ INSTR_TIME_SET_CURRENT(delay_end);
+ INSTR_TIME_SET_ZERO(delay);
+ INSTR_TIME_ACCUM_DIFF(delay, delay_end, delay_start);
+
+ /*
+ * For parallel workers, we only report the delay time every once
+ * in a while to avoid overloading the leader with messages and
+ * interrupts.
+ */
+ if (IsParallelWorker())
+ {
+ static instr_time last_report_time;
+ instr_time time_since_last_report;
+
+ Assert(!is_analyze);
+
+ /* Accumulate the delay time. */
+ parallel_vacuum_worker_delay_ns += INSTR_TIME_GET_NANOSEC(delay);
+
+ /* Calculate interval since last report. */
+ INSTR_TIME_SET_ZERO(time_since_last_report);
+ INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, last_report_time);
+
+ /* If we haven't reported in a while, do so now. */
+ if (INSTR_TIME_GET_NANOSEC(time_since_last_report) >=
+ PARALLEL_VACUUM_DELAY_REPORT_INTERVAL_NS)
+ {
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
+ /* Reset variables. */
+ last_report_time = delay_end;
+ parallel_vacuum_worker_delay_ns = 0;
+ }
+ }
+ else if (is_analyze)
+ pgstat_progress_incr_param(PROGRESS_ANALYZE_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ else
+ pgstat_progress_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ INSTR_TIME_GET_NANOSEC(delay));
+ }
+
/*
* We don't want to ignore postmaster death during very long vacuums
* with vacuum_cost_delay configured. We can't use the usual
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index dc3322c256b..2b9d548cdeb 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1094,6 +1094,11 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
+ /* Report any remaining cost-based vacuum delay time */
+ if (track_cost_delay_timing)
+ pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_DELAY_TIME,
+ parallel_vacuum_worker_delay_ns);
+
TidStoreDetach(dead_items);
/* Pop the error context stack */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ce7534d4d23..1efee7af176 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1470,6 +1470,15 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"track_cost_delay_timing", PGC_SUSET, STATS_CUMULATIVE,
+ gettext_noop("Collects timing statistics for cost-based vacuum delay."),
+ NULL
+ },
+ &track_cost_delay_timing,
+ false,
+ NULL, NULL, NULL
+ },
{
{"track_io_timing", PGC_SUSET, STATS_CUMULATIVE,
gettext_noop("Collects timing statistics for database I/O activity."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c40b7a3121e..6f77e5f8b26 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -632,6 +632,7 @@
#track_activities = on
#track_activity_query_size = 1024 # (change requires restart)
#track_counts = on
+#track_cost_delay_timing = off
#track_io_timing = off
#track_wal_io_timing = off
#track_functions = none # none, pl, all
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..7c736e7b03b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
#define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7
#define PROGRESS_VACUUM_INDEXES_TOTAL 8
#define PROGRESS_VACUUM_INDEXES_PROCESSED 9
+#define PROGRESS_VACUUM_DELAY_TIME 10
/* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
#define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1
@@ -46,6 +47,7 @@
#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL 5
#define PROGRESS_ANALYZE_CHILD_TABLES_DONE 6
#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID 7
+#define PROGRESS_ANALYZE_DELAY_TIME 8
/* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
#define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b884304dfe7..b3eedf699af 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -296,6 +296,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool track_cost_delay_timing;
/*
* Maximum value for default_statistics_target and per-column statistics
@@ -313,6 +314,8 @@ extern PGDLLIMPORT bool VacuumFailsafeActive;
extern PGDLLIMPORT double vacuum_cost_delay;
extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT int64 parallel_vacuum_worker_delay_ns;
+
/* in commands/vacuum.c */
extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
extern void vacuum(List *relations, VacuumParams *params,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3361f6a69c9..5baba8d39ff 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1932,7 +1932,8 @@ pg_stat_progress_analyze| SELECT s.pid,
s.param5 AS ext_stats_computed,
s.param6 AS child_tables_total,
s.param7 AS child_tables_done,
- (s.param8)::oid AS current_child_table_relid
+ (s.param8)::oid AS current_child_table_relid,
+ ((s.param9)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('ANALYZE'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_basebackup| SELECT pid,
@@ -2062,7 +2063,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
s.param7 AS dead_tuple_bytes,
s.param8 AS num_dead_item_ids,
s.param9 AS indexes_total,
- s.param10 AS indexes_processed
+ s.param10 AS indexes_processed,
+ ((s.param11)::double precision / (1000000)::double precision) AS delay_time
FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_recovery_prefetch| SELECT stats_reset,
--
2.34.1
v17-0003-Add-cost-based-delay-time-to-VACUUM-ANALYZE-VERB.patchtext/x-diff; charset=us-asciiDownload
From e270ddb7e599e22b90b4d6702596770f67569f07 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Tue, 11 Feb 2025 07:37:05 +0000
Subject: [PATCH v17 3/3] Add cost-based delay time to VACUUM/ANALYZE (VERBOSE)
This commit adds cost-based delay time to VACUUM/ANALYZE (VERBOSE) and to
autovacuum logs (if track_cost_delay_timing is enabled).
---
doc/src/sgml/config.sgml | 8 +++++---
src/backend/access/heap/vacuumlazy.c | 6 ++++++
src/backend/commands/analyze.c | 6 ++++++
3 files changed, 17 insertions(+), 3 deletions(-)
49.4% doc/src/sgml/
25.2% src/backend/access/heap/
25.3% src/backend/commands/
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 79a66ba7181..15001c01046 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8261,9 +8261,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
platforms. You can use the <xref linkend="pgtesttiming"/> tool to
measure the overhead of timing on your system. Cost-based vacuum delay
timing information is displayed in
- <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
- and
- <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>,
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>,
+ in the output of <xref linkend="sql-vacuum"/> when the <literal>VERBOSE</literal>
+ option is used, by autovacuum for auto-vacuums and auto-analyzes,
+ when <xref linkend="guc-log-autovacuum-min-duration"/> is set.
Only superusers and users with the appropriate <literal>SET</literal>
privilege can change this setting.
</para>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e4d6d654c0a..735a60b3d9a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -830,6 +830,12 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
appendStringInfo(&buf, _("I/O timings: read: %.3f ms, write: %.3f ms\n"),
read_ms, write_ms);
}
+ if (track_cost_delay_timing)
+ {
+ double delayed_ms = (double) MyBEEntry->st_progress_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0;
+
+ appendStringInfo(&buf, _("delay time: %.3f ms\n"), delayed_ms);
+ }
if (secs_dur > 0 || usecs_dur > 0)
{
read_rate = (double) BLCKSZ * total_blks_read /
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e4302f4cdb2..91751d15e66 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -816,6 +816,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
appendStringInfo(&buf, _("I/O timings: read: %.3f ms, write: %.3f ms\n"),
read_ms, write_ms);
}
+ if (track_cost_delay_timing)
+ {
+ double delayed_ms = (double) MyBEEntry->st_progress_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0;
+
+ appendStringInfo(&buf, _("delay time: %.3f ms\n"), delayed_ms);
+ }
appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
read_rate, write_rate);
appendStringInfo(&buf, _("buffer usage: %lld hits, %lld reads, %lld dirtied\n"),
--
2.34.1
v17-0001-Add-is_analyze-parameter-to-vacuum_delay_point.patchtext/x-diff; charset=us-asciiDownload
From f4640855bbeb0b2fb63e8ba5afff61c04302fa40 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 10 Feb 2025 11:41:22 -0600
Subject: [PATCH v17 1/3] Add is_analyze parameter to vacuum_delay_point().
This function is used in both vacuum and analyze code paths, and a
follow-up commit will require distinguishing between the two. This
commit forces callers to declare whether they are being used for
vacuum or analyze, but it does not use that information for
anything yet.
Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
contrib/bloom/blvacuum.c | 4 ++--
contrib/file_fdw/file_fdw.c | 2 +-
src/backend/access/gin/ginfast.c | 6 +++---
src/backend/access/gin/ginvacuum.c | 6 +++---
src/backend/access/gist/gistvacuum.c | 2 +-
src/backend/access/hash/hash.c | 2 +-
src/backend/access/heap/vacuumlazy.c | 4 ++--
src/backend/access/nbtree/nbtree.c | 2 +-
src/backend/access/spgist/spgvacuum.c | 4 ++--
src/backend/commands/analyze.c | 10 +++++-----
src/backend/commands/vacuum.c | 2 +-
src/backend/tsearch/ts_typanalyze.c | 2 +-
src/backend/utils/adt/array_typanalyze.c | 2 +-
src/backend/utils/adt/rangetypes_typanalyze.c | 2 +-
src/include/commands/vacuum.h | 2 +-
15 files changed, 26 insertions(+), 26 deletions(-)
7.5% contrib/bloom/
3.6% contrib/file_fdw/
22.9% src/backend/access/gin/
3.6% src/backend/access/gist/
3.7% src/backend/access/hash/
7.5% src/backend/access/heap/
3.6% src/backend/access/nbtree/
7.3% src/backend/access/spgist/
22.8% src/backend/commands/
3.6% src/backend/tsearch/
7.3% src/backend/utils/adt/
6.1% src/include/commands/
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 7e1db0b52fc..86b15a75f6f 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -57,7 +57,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*itupPtr,
*itupEnd;
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, info->strategy);
@@ -187,7 +187,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
Buffer buffer;
Page page;
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, info->strategy);
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 678e754b2b9..0655bf532a0 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
for (;;)
{
/* Check for user-requested abort or sleep */
- vacuum_delay_point();
+ vacuum_delay_point(true);
/* Fetch next row */
MemoryContextReset(tupcontext);
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 4ab815fefe0..cc5d046c4b0 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -892,7 +892,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
*/
processPendingPage(&accum, &datums, page, FirstOffsetNumber);
- vacuum_delay_point();
+ vacuum_delay_point(false);
/*
* Is it time to flush memory to disk? Flush if we are at the end of
@@ -929,7 +929,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
{
ginEntryInsert(ginstate, attnum, key, category,
list, nlist, NULL);
- vacuum_delay_point();
+ vacuum_delay_point(false);
}
/*
@@ -1002,7 +1002,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
/*
* Read next page in pending list
*/
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, GIN_SHARE);
page = BufferGetPage(buffer);
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index d98c54b7cf7..533c37b3c5f 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -662,12 +662,12 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
UnlockReleaseBuffer(buffer);
}
- vacuum_delay_point();
+ vacuum_delay_point(false);
for (i = 0; i < nRoot; i++)
{
ginVacuumPostingTree(&gvs, rootOfPostingTree[i]);
- vacuum_delay_point();
+ vacuum_delay_point(false);
}
if (blkno == InvalidBlockNumber) /* rightmost page */
@@ -748,7 +748,7 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
Buffer buffer;
Page page;
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, info->strategy);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fe0bfb781ca..dd0d9d5006c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -283,7 +283,7 @@ restart:
recurse_to = InvalidBlockNumber;
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
info->strategy);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 63b568e7f24..4167b33e683 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -716,7 +716,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
bool retain_pin = false;
bool clear_dead_marking = false;
- vacuum_delay_point();
+ vacuum_delay_point(false);
page = BufferGetPage(buf);
opaque = HashPageGetOpaque(page);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 075af385cd1..e4d6d654c0a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -946,7 +946,7 @@ lazy_scan_heap(LVRelState *vacrel)
update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
blkno, InvalidOffsetNumber);
- vacuum_delay_point();
+ vacuum_delay_point(false);
/*
* Regularly check if wraparound failsafe should trigger.
@@ -2275,7 +2275,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
OffsetNumber offsets[MaxOffsetNumber];
int num_offsets;
- vacuum_delay_point();
+ vacuum_delay_point(false);
blkno = iter_result->blkno;
vacrel->blkno = blkno;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 971405e89af..dc244ae24c7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1137,7 +1137,7 @@ backtrack:
backtrack_to = P_NONE;
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
/*
* We can't use _bt_getbuf() here because it always applies
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..1c52f6528ad 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -625,7 +625,7 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
Page page;
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
RBM_NORMAL, bds->info->strategy);
@@ -704,7 +704,7 @@ spgprocesspending(spgBulkDeleteState *bds)
continue; /* ignore already-done items */
/* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point();
+ vacuum_delay_point(false);
/* examine the referenced page */
blkno = ItemPointerGetBlockNumber(&pitem->tid);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e5ab207d2ec..e4302f4cdb2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -915,7 +915,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
HeapTuple heapTuple = rows[rowno];
- vacuum_delay_point();
+ vacuum_delay_point(true);
/*
* Reset the per-tuple context each time, to reclaim any cruft
@@ -1238,7 +1238,7 @@ acquire_sample_rows(Relation onerel, int elevel,
/* Outer loop over blocks to sample */
while (table_scan_analyze_next_block(scan, stream))
{
- vacuum_delay_point();
+ vacuum_delay_point(true);
while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
@@ -1970,7 +1970,7 @@ compute_trivial_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, i, &isnull);
@@ -2086,7 +2086,7 @@ compute_distinct_stats(VacAttrStatsP stats,
int firstcount1,
j;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, i, &isnull);
@@ -2433,7 +2433,7 @@ compute_scalar_stats(VacAttrStatsP stats,
Datum value;
bool isnull;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, i, &isnull);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e6745e6145c..5e394c151c9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2358,7 +2358,7 @@ vac_close_indexes(int nindexes, Relation *Irel, LOCKMODE lockmode)
* typically once per page processed.
*/
void
-vacuum_delay_point(void)
+vacuum_delay_point(bool is_analyze)
{
double msec = 0;
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 1494da1c9d3..c5a71331ce8 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -204,7 +204,7 @@ compute_tsvector_stats(VacAttrStats *stats,
char *lexemesptr;
int j;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, vector_no, &isnull);
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 44a6eb5dad0..6f61629b977 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -314,7 +314,7 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
int distinct_count;
bool count_item_found;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, array_no, &isnull);
if (isnull)
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index 9dc73af1992..a18196d8a34 100644
--- a/src/backend/utils/adt/rangetypes_typanalyze.c
+++ b/src/backend/utils/adt/rangetypes_typanalyze.c
@@ -167,7 +167,7 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
upper;
float8 length;
- vacuum_delay_point();
+ vacuum_delay_point(true);
value = fetchfunc(stats, range_no, &isnull);
if (isnull)
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950d..b884304dfe7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -339,7 +339,7 @@ extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
struct VacuumCutoffs *cutoffs);
extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
extern void vac_update_datfrozenxid(void);
-extern void vacuum_delay_point(void);
+extern void vacuum_delay_point(bool is_analyze);
extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
bits32 options);
extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
--
2.34.1
On Tue, Feb 11, 2025 at 08:51:15AM +0000, Bertrand Drouvot wrote:
On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
Off-list, I've asked Bertrand to gauge the feasibility of adding this
information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE). IMHO
those are natural places to surface this information, and I want to ensure
that we're not painting ourselves into a corner with the approach we're
using for the progress views.Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
leader that is doing the report in case of parallel workers being used).0001 and 0002 remain unchanged.
Thanks. I've committed 0001 and 0002.
--
nathan
Hi,
On Tue, Feb 11, 2025 at 04:42:26PM -0600, Nathan Bossart wrote:
On Tue, Feb 11, 2025 at 08:51:15AM +0000, Bertrand Drouvot wrote:
On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
Off-list, I've asked Bertrand to gauge the feasibility of adding this
information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE). IMHO
those are natural places to surface this information, and I want to ensure
that we're not painting ourselves into a corner with the approach we're
using for the progress views.Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
leader that is doing the report in case of parallel workers being used).0001 and 0002 remain unchanged.
Thanks. I've committed 0001 and 0002.
Thanks! Regarding 0003 I think it's ok to keep it in this thread (and not
create a dedicated one), as it still fits well with $SUBJECT (and the folks
interested in are probably already part of this thread). Sounds good to you?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Feb 12, 2025 at 06:19:12AM +0000, Bertrand Drouvot wrote:
Thanks! Regarding 0003 I think it's ok to keep it in this thread (and not
create a dedicated one), as it still fits well with $SUBJECT (and the folks
interested in are probably already part of this thread). Sounds good to you?
Yup. Here is what I have staged for commit. I'll create a commifest entry
for this one and give it a couple days for review, but this seems pretty
straightforward.
--
nathan
Attachments:
v18-0001-Add-delay-time-to-VACUUM-ANALYZE-VERBOSE-and-aut.patchtext/plain; charset=us-asciiDownload
From 3853d2e72a2f3d2de43080482450922ebb715c37 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 12 Feb 2025 11:03:56 -0600
Subject: [PATCH v18 1/1] Add delay time to VACUUM/ANALYZE (VERBOSE) and
autovacuum logs.
Commit bb8dff9995 added this information to the
pg_stat_progress_vacuum and pg_stat_progress_analyze system views.
This commit adds the same information to the output of VACUUM and
ANALYZE with the VERBOSE option and to the autovacuum logs.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
doc/src/sgml/config.sgml | 9 ++++++---
src/backend/access/heap/vacuumlazy.c | 3 +++
src/backend/commands/analyze.c | 3 +++
3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5e4f201e099..60829b79d83 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8267,9 +8267,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
platforms. You can use the <xref linkend="pgtesttiming"/> tool to
measure the overhead of timing on your system. Cost-based vacuum delay
timing information is displayed in
- <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>
- and
- <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>.
+ <link linkend="vacuum-progress-reporting"><structname>pg_stat_progress_vacuum</structname></link>,
+ <link linkend="analyze-progress-reporting"><structname>pg_stat_progress_analyze</structname></link>,
+ in the output of <xref linkend="sql-vacuum"/> when the
+ <literal>VERBOSE</literal> option is used, and by autovacuum for
+ auto-vacuums and auto-analyzes when
+ <xref linkend="guc-log-autovacuum-min-duration"/> is set.
Only superusers and users with the appropriate <literal>SET</literal>
privilege can change this setting.
</para>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3df5b92afb8..a7016185476 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1083,6 +1083,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
istat->pages_deleted,
istat->pages_free);
}
+ if (track_cost_delay_timing)
+ appendStringInfo(&buf, _("cost delay time: %.3f ms\n"),
+ (double) MyBEEntry->st_progress_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
if (track_io_timing)
{
double read_ms = (double) (pgStatBlockReadTime - startreadtime) / 1000;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e4302f4cdb2..1a94eac6381 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -808,6 +808,9 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
get_database_name(MyDatabaseId),
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel));
+ if (track_cost_delay_timing)
+ appendStringInfo(&buf, _("cost delay time: %.3f ms\n"),
+ (double) MyBEEntry->st_progress_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
if (track_io_timing)
{
double read_ms = (double) (pgStatBlockReadTime - startreadtime) / 1000;
--
2.39.5 (Apple Git-154)
Hi,
On Wed, Feb 12, 2025 at 11:13:13AM -0600, Nathan Bossart wrote:
On Wed, Feb 12, 2025 at 06:19:12AM +0000, Bertrand Drouvot wrote:
Thanks! Regarding 0003 I think it's ok to keep it in this thread (and not
create a dedicated one), as it still fits well with $SUBJECT (and the folks
interested in are probably already part of this thread). Sounds good to you?Yup. Here is what I have staged for commit.
Thanks! I can see that those changes have been made:
1. The order in the output (it's now displayed before "track_io_timing")
2. The "and" addition in the doc
All of the above is fine by me.
I'll create a commifest entry
for this one and give it a couple days for review, but this seems pretty
straightforward.
Yup, sounds like a plan, thanks!
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Committed.
I noticed two things that I felt I should note here:
* For the vacuum path, we call pgstat_progress_end_command() prior to
accessing the value. This works because pgstat_progress_end_command()
doesn't clear the st_progress_param array (that is done in
pgstat_progress_start_command()). AFAICT it's worked this way since this
stuff was introduced ~9 years ago, and I have no reason to believe it
will change anytime soon.
* We are bypassing the changecount mechanism when accessing the value. I
believe this is okay because the calling process is the only one that
updates these values. Even in the parallel worker case, the worker sends
a message to the leader to increment the value. Perhaps this could break
in the future if we switched to using atomics or something, but that
approach was already considered and abandoned once before [0]/messages/by-id/72CD33F6-C2B5-45E4-A78F-85EC923DCF0F@amazon.com, and the
worst case scenario would likely be compiling errors or bogus delay
values.
So, I chose to just add comments about this stuff for now. If someone
feels strongly that we should do pro forma changecount checks before
pgstat_progress_end_command(), I'm happy to draft the patch.
[0]: /messages/by-id/72CD33F6-C2B5-45E4-A78F-85EC923DCF0F@amazon.com
--
nathan