Restart pg_usleep when interrupted
Attachment protected by Amazon:
[0001-Handle-Sleep-interrupts.patch]
https://us-east-1.secure-attach.amazon.com/fcdc82ce-7887-4aa1-af9e-c1161a6b1d6f/bc81fa24-41de-48f9-a767-a6d15801754b
Amazon has replaced attachment with download link. Downloads will be available until July 28, 2024, 19:59 (UTC+00:00).
[Tell us what you think] https://amazonexteu.qualtrics.com/jfe/form/SV_ehuz6zGo8YnsRKK
[For more information click here] https://docs.secure-attach.amazon.com/guide
Hi,
In the proposal by Bertrand [1]/messages/by-id/ZnbIIUQpFJKAJx2W@ip-10-97-1-34.eu-west-3.compute.internal to implement vacuum cost delay tracking
in pg_stat_progress_vacuum, it was discovered that the vacuum cost delay
ends early on the leader process of a parallel vacuum due to parallel workers
reporting progress on index vacuuming, which was introduced in 17
with commit [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=46ebdfe164c61fbac961d1eb7f40e9a684289ae6. With this patch, everytime a parallel worker
completes a vacuum index, it will send a completion message to the leader.
The facility that allows a parallel worker to report progress to the leader was
introduced in commit [3]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f1889729dd3ab0352dc0ccc2ffcc1b1901f8e39f.
In the case of the patch being proposed by Bertrand, the number of interrupts
will be much more frequent as parallel workers would send a message to the leader
to update the vacuum delay counters every vacuum_delay_point call.
Looking into this, the ideal solution appears to provide the ability for a pg_usleep
call to restart after being interrupted. Thanks to the usage of nanosleep as
of commit[4]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a948e49e2ef11815be0b211723bfc5b53b7f75a8, this should be trivial to do as nanosleep
provides a remaining time, which can be used to restart the sleep. This
idea was also mentioned in [5]/messages/by-id/24848.1473607575@sss.pgh.pa.us.
I am attaching a POC patch that does the $SUBJECT. Rather than changing the
existing pg_usleep, the patch introduces a function, pg_usleep_handle_interrupt,
that takes in the sleep duration and a boolean flag to force handling of the
interrupt, or not.
This function can replace pg_usleep inside vacuum_delay_point, and could be
useful in other cases in which it’s important to handle interrupts.
For Windows, the “force” = “true” flag of the new function uses SleepEx which
from what I can tell based on the code comments is a non-interruptible sleep.
Thoughts?
[1]: /messages/by-id/ZnbIIUQpFJKAJx2W@ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=46ebdfe164c61fbac961d1eb7f40e9a684289ae6
[3]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f1889729dd3ab0352dc0ccc2ffcc1b1901f8e39f
[4]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a948e49e2ef11815be0b211723bfc5b53b7f75a8
[5]: /messages/by-id/24848.1473607575@sss.pgh.pa.us
Regards,
Sami Imseih
Amazon Web Services (AWS)
Import Notes
Reference msg id not found: 152895F3-6B22-45ED-8038-9E14A72AAC88@amazon.com
Hi,
I think you need to find a way to disable this "Attachment protected
by Amazon" thing:
/messages/by-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000@email.amazonses.com
We want patches to be in the pgsql-hackers archives, not temporarily
accessible via some link.
...Robert
I think you need to find a way to disable this "Attachment protected
by Amazon" thing:
Yes, I am looking into that. I only noticed after sending the email.
Sorry about that.
Sami
We want patches to be in the pgsql-hackers archives, not temporarily
accessible via some link.…Robert
Moving to another email going forward.
Reattaching the patch.
Attachments:
0001-Handle-Sleep-interrupts.patchapplication/octet-stream; name=0001-Handle-Sleep-interrupts.patch; x-unix-mode=0644Download
From 0a22878b8a79347e77f82449a84e93c7ca78ecb9 Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Fri, 28 Jun 2024 14:40:12 -0500
Subject: [PATCH 1/1] Handle Sleep interrupts
---
src/backend/commands/vacuum.c | 2 +-
src/backend/port/win32/signal.c | 8 +++++++-
src/include/port.h | 1 +
src/port/pgsleep.c | 14 +++++++++++++-
4 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..b53d137bf5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2384,7 +2384,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ pg_usleep_handle_interrupt(msec * 1000, true);
pgstat_report_wait_end();
/*
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 285cb611b4..b7a26b5eb5 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -52,7 +52,13 @@ static BOOL WINAPI pg_console_handler(DWORD dwCtrlType);
void
pg_usleep(long microsec)
{
- if (unlikely(pgwin32_signal_event == NULL))
+ pg_usleep_handle_interrupt(microsec, false);
+}
+
+void
+pg_usleep_handle_interrupt(long microsec, bool force)
+{
+ if (unlikely(pgwin32_signal_event == NULL) || force)
{
/*
* If we're reached by pgwin32_open_handle() early in startup before
diff --git a/src/include/port.h b/src/include/port.h
index ae115d2d97..b6b74d05e8 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -162,6 +162,7 @@ extern int pg_disable_aslr(void);
/* Portable delay handling */
extern void pg_usleep(long microsec);
+extern void pg_usleep_handle_interrupt(long microsec, bool force);
/* Portable SQL-like case-independent comparisons and conversions */
extern int pg_strcasecmp(const char *s1, const char *s2);
diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 1284458bfc..b45729e219 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -39,15 +39,27 @@
*/
void
pg_usleep(long microsec)
+{
+ pg_usleep_handle_interrupt(microsec, false);
+}
+
+void
+pg_usleep_handle_interrupt(long microsec, bool force)
{
if (microsec > 0)
{
#ifndef WIN32
struct timespec delay;
+ struct timespec remaining;
delay.tv_sec = microsec / 1000000L;
delay.tv_nsec = (microsec % 1000000L) * 1000;
- (void) nanosleep(&delay, NULL);
+
+ if (force)
+ while (nanosleep(&delay, &remaining) == -1 && errno == EINTR)
+ delay = remaining;
+ else
+ (void) nanosleep(&delay, NULL);
#else
SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
#endif
--
2.39.3 (Apple Git-146)
Sami Imseih <samimseih@gmail.com> writes:
Reattaching the patch.
I feel like this is fundamentally a wrong solution, for the reasons
cited in the comment for pg_usleep: long sleeps are a bad idea
because of the resulting uncertainty about whether we'll respond to
interrupts and such promptly. An example here is that if we get
a query cancel interrupt, we should probably not insist on finishing
out the current sleep before responding.
Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place. A better idea might be to use
pg_sleep() or equivalent code.
regards, tom lane
On Sat, Jun 29, 2024 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place. A better idea might be to use
pg_sleep() or equivalent code.
In case it's useful for someone looking into that, in earlier
discussions we figured out that it is possible to have high resolution
timeouts AND support latch multiplexing on Linux, FreeBSD, macOS:
/messages/by-id/CA+hUKG+hC9mFx8tEcBsyo7-cAfWgtbRy1eDizeFuff2K7T=4bA@mail.gmail.com
Thanks for the feedback!
On Jun 28, 2024, at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Sami Imseih <samimseih@gmail.com> writes:
Reattaching the patch.
I feel like this is fundamentally a wrong solution, for the reasons
cited in the comment for pg_usleep: long sleeps are a bad idea
because of the resulting uncertainty about whether we'll respond to
interrupts and such promptly. An example here is that if we get
a query cancel interrupt, we should probably not insist on finishing
out the current sleep before responding.
The case which brought up this discussion is the pg_usleep that
is called within the vacuum_delay_point being interrupted.
When I read the same code comment you cited, it sounded to me
that “long sleeps” are those that are in seconds or minutes. The longest
vacuum delay allowed is 100ms.
Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place. A better idea might be to use
pg_sleep() or equivalent code.
Yes, that is a good idea to explore and it will not require introducing
an awkward new API. I will look into using something similar to
pg_sleep.
Regards,
Sami
Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place. A better idea might be to use
pg_sleep() or equivalent code.Yes, that is a good idea to explore and it will not require introducing
an awkward new API. I will look into using something similar to
pg_sleep.
Looking through the history of the sleep in vacuum_delay_point, commit
720de00af49 replaced WaitLatch with pg_usleep to allow for microsecond
sleep precision [1]/messages/by-id/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg@mail.gmail.com.
Thomas has proposed a WaitLatchUs implementation in [2]/messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com, but I have not
yet tried it.
So I see there are 2 possible options here to deal with the interrupt of a
parallel vacuum leader when a message is sent by a parallel vacuum worker.
Option 1/ something like my initial proposal which is
to create a function similar to pg_usleep that is able to deal with
interrupts in a sleep. This could be a function scoped only to vacuum.c,
so it can only be used for vacuum delay purposes.
——
Option 2/ to explore the WaitLatchUs implementation by
Thomas which will give both a latch implementation for a sleep with
the microsecond precision.
It is worth mentioning that if we do end up using WaitLatch(Us) inside
vacuum_delay_point, it will need to set only WL_TIMEOUT and
WL_EXIT_ON_PM_DEATH.
i.e.
(void) WaitLatch(MyLatch, WL_TIMEOUT| WL_EXIT_ON_PM_DEATH,
msec
WAIT_EVENT_VACUUM_DELAY);
This way it is not interrupted by a WL_LATCH_SET when a message
is set by a parallel worker.
——
Ultimately, I think option 2 may be worth a closer look as it is a cleaner
and safer approach, to detect a postmaster death.
Thoughts?
[1]: /messages/by-id/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg@mail.gmail.com
[2]: /messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
Hi,
On Fri, Jun 28, 2024 at 05:50:20PM -0500, Sami Imseih wrote:
Thanks for the feedback!
On Jun 28, 2024, at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Sami Imseih <samimseih@gmail.com> writes:
Reattaching the patch.
I feel like this is fundamentally a wrong solution, for the reasons
cited in the comment for pg_usleep: long sleeps are a bad idea
because of the resulting uncertainty about whether we'll respond to
interrupts and such promptly. An example here is that if we get
a query cancel interrupt, we should probably not insist on finishing
out the current sleep before responding.The case which brought up this discussion is the pg_usleep that
is called within the vacuum_delay_point being interrupted.When I read the same code comment you cited, it sounded to me
that “long sleeps” are those that are in seconds or minutes. The longest
vacuum delay allowed is 100ms.
I think that with the proposed patch the actual wait time can be "long".
Indeed, the time between the interruptions and restarts of the nanosleep() call
will lead to drift (as mentioned in the nanosleep() man page). So, with a large
number of interruptions, the actual wait time could be way longer than the
expected wait time.
To put numbers, I did a few tests with the patch (and with v2 shared in [1]/messages/by-id/ZmaXmWDL829fzAVX@ip-10-97-1-34.eu-west-3.compute.internal):
cost delay is 1ms and cost limit is 10.
With 50 indexes and 10 parallel workers I can see things like:
2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.000000, actual 239.378368
2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.100000, actual 224.331737
2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.300000, actual 230.462793
2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.000000, actual 225.980803
Means we waited more than the max allowed cost delay (100ms).
With 49 parallel workers, it's worst as I can see things like:
2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.000000, actual 1106.790136
2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.000000, actual 218.148985
The first actual wait time is about 1 second (it has been interrupted about
16300 times during this second).
To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep()
with an absolute time value, that might be another idea to explore.
[1]: /messages/by-id/ZmaXmWDL829fzAVX@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
With 50 indexes and 10 parallel workers I can see things like:
2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.000000, actual 239.378368
2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.100000, actual 224.331737
2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.300000, actual 230.462793
2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.000000, actual 225.980803Means we waited more than the max allowed cost delay (100ms).
With 49 parallel workers, it's worst as I can see things like:
2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.000000, actual 1106.790136
2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.000000, actual 218.148985The first actual wait time is about 1 second (it has been interrupted about
16300 times during this second).To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep()
with an absolute time value, that might be another idea to explore.[1]: /messages/by-id/ZmaXmWDL829fzAVX@ip-10-97-1-34.eu-west-3.compute.internal
I could not reproduce the same time you drift you observed on my
machine, so I am guessing the time drift could be worse on certain
platforms than others.
I also looked into the WaitLatchUs patch proposed by Thomas in [1]/messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
and since my system does have epoll_pwait(2) available, I could not
achieve the sub-millisecond wait times.
A more portable approach which could be to continue using nanosleep and
add checks to ensure that nanosleep exists whenever
it goes past an absolute time. This was suggested by Bertrand in an offline
conversation. I am not yet fully convinced of this idea, but posting the patch
that implements this idea for anyone interested in looking.
Since sub-millisecond sleep times are not guaranteed as suggested by
the vacuum_cost_delay docs ( see below ), an alternative idea
is to use clock_nanosleep for vacuum delay when it’s available, else
fallback to WaitLatch.
"While vacuum_cost_delay can be set to fractional-millisecond values,
such delays may not be measured accurately on older platforms”
[1]: /messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
Regards,
Sami

Attachments:
0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From 942de12a49f53e20ab460d68191a6dc576f0750b Mon Sep 17 00:00:00 2001
From: EC2 Default User <ec2-user@ip-172-31-18-99.ec2.internal>
Date: Fri, 5 Jul 2024 16:47:21 +0000
Subject: [PATCH 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 74 +++++++++++++++++++++++++++++------
1 file changed, 63 insertions(+), 11 deletions(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..bfa024f583 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -116,6 +117,58 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining time.
+ * However, since nanosleep is susceptible to time drift when interrupted
+ * frequently, we add a safeguard to break out of the nanosleep whenever the
+ * current time is past the absolute time. The absolute time for sleeping is
+ * set before the nanosleep loop starts and the current time is checked
+ * within the loop whenever nanosleep encounters an interrupt.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ struct timespec absolute;
+
+ clock_gettime(PG_INSTR_CLOCK, &absolute);
+
+ absolute.tv_sec += microsec / 1000000L;
+ absolute.tv_nsec += (microsec % 1000000L) * 1000;
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while(nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ struct timespec current;
+ float time_diff;
+
+ clock_gettime(PG_INSTR_CLOCK, ¤t);
+
+ time_diff = (absolute.tv_sec - current.tv_sec) + (absolute.tv_nsec - current.tv_nsec) / 1000000000.0;
+
+ if (time_diff <= 0)
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+
+ if (IsUnderPostmaster && !PostmasterIsAlive())
+ exit(1);
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2380,21 +2433,20 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
- pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
- pgstat_report_wait_end();
+ INSTR_TIME_SET_CURRENT(delay_start);
+ vacuum_sleep(msec);
+ INSTR_TIME_SET_CURRENT(delay_end);
- /*
- * We don't want to ignore postmaster death during very long vacuums
- * with vacuum_cost_delay configured. We can't use the usual
- * WaitLatch() approach here because we want microsecond-based sleep
- * durations above.
- */
- if (IsUnderPostmaster && !PostmasterIsAlive())
- exit(1);
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+ elog(LOG, "msec = %lf, delayed_time = %lf", msec, INSTR_TIME_GET_MILLISEC(delayed_time));
VacuumCostBalance = 0;
--
2.40.1
A more portable approach which could be to continue using nanosleep and
add checks to ensure that nanosleep exists whenever
it goes past an absolute time. This was suggested by Bertrand in an offline
conversation. I am not yet fully convinced of this idea, but posting the patch
that implements this idea for anyone interested in looking.
oops, forgot to attach the patch. Here it is.
Attachments:
0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From 942de12a49f53e20ab460d68191a6dc576f0750b Mon Sep 17 00:00:00 2001
From: EC2 Default User <ec2-user@ip-172-31-18-99.ec2.internal>
Date: Fri, 5 Jul 2024 16:47:21 +0000
Subject: [PATCH 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 74 +++++++++++++++++++++++++++++------
1 file changed, 63 insertions(+), 11 deletions(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..bfa024f583 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
#include "catalog/pg_inherits.h"
#include "commands/cluster.h"
#include "commands/defrem.h"
+#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
@@ -116,6 +117,58 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining time.
+ * However, since nanosleep is susceptible to time drift when interrupted
+ * frequently, we add a safeguard to break out of the nanosleep whenever the
+ * current time is past the absolute time. The absolute time for sleeping is
+ * set before the nanosleep loop starts and the current time is checked
+ * within the loop whenever nanosleep encounters an interrupt.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ struct timespec absolute;
+
+ clock_gettime(PG_INSTR_CLOCK, &absolute);
+
+ absolute.tv_sec += microsec / 1000000L;
+ absolute.tv_nsec += (microsec % 1000000L) * 1000;
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while(nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ struct timespec current;
+ float time_diff;
+
+ clock_gettime(PG_INSTR_CLOCK, ¤t);
+
+ time_diff = (absolute.tv_sec - current.tv_sec) + (absolute.tv_nsec - current.tv_nsec) / 1000000000.0;
+
+ if (time_diff <= 0)
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+
+ if (IsUnderPostmaster && !PostmasterIsAlive())
+ exit(1);
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2380,21 +2433,20 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+ instr_time delay_start;
+ instr_time delay_end;
+ instr_time delayed_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
- pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
- pgstat_report_wait_end();
+ INSTR_TIME_SET_CURRENT(delay_start);
+ vacuum_sleep(msec);
+ INSTR_TIME_SET_CURRENT(delay_end);
- /*
- * We don't want to ignore postmaster death during very long vacuums
- * with vacuum_cost_delay configured. We can't use the usual
- * WaitLatch() approach here because we want microsecond-based sleep
- * durations above.
- */
- if (IsUnderPostmaster && !PostmasterIsAlive())
- exit(1);
+ INSTR_TIME_SET_ZERO(delayed_time);
+ INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+ elog(LOG, "msec = %lf, delayed_time = %lf", msec, INSTR_TIME_GET_MILLISEC(delayed_time));
VacuumCostBalance = 0;
--
2.40.1
Hi,
On Fri, Jul 05, 2024 at 11:49:45AM -0500, Sami Imseih wrote:
With 50 indexes and 10 parallel workers I can see things like:
2024-07-02 08:22:23.789 UTC [2189616] LOG: expected 1.000000, actual 239.378368
2024-07-02 08:22:24.575 UTC [2189616] LOG: expected 0.100000, actual 224.331737
2024-07-02 08:22:25.363 UTC [2189616] LOG: expected 1.300000, actual 230.462793
2024-07-02 08:22:26.154 UTC [2189616] LOG: expected 1.000000, actual 225.980803Means we waited more than the max allowed cost delay (100ms).
With 49 parallel workers, it's worst as I can see things like:
2024-07-02 08:26:36.069 UTC [2189807] LOG: expected 1.000000, actual 1106.790136
2024-07-02 08:26:36.298 UTC [2189807] LOG: expected 1.000000, actual 218.148985The first actual wait time is about 1 second (it has been interrupted about
16300 times during this second).To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep()
with an absolute time value, that might be another idea to explore.[1]: /messages/by-id/ZmaXmWDL829fzAVX@ip-10-97-1-34.eu-west-3.compute.internal
A more portable approach which could be to continue using nanosleep and
add checks to ensure that nanosleep exists whenever
it goes past an absolute time. This was suggested by Bertrand in an offline
conversation. I am not yet fully convinced of this idea, but posting the patch
that implements this idea for anyone interested in looking.
Thanks!
I did a few tests with the patch and did not see any "large" drifts like the
ones observed above.
As far the patch, not thoroughly review (as it's still one option among others
being discussed)):
+ struct timespec current;
+ float time_diff;
+
+ clock_gettime(PG_INSTR_CLOCK, ¤t);
+
+ time_diff = (absolute.tv_sec - current.tv_sec) + (absolute.tv_nsec - current.tv_nsec) / 1000000000.0;
I think it could be "simplified" by making use of instr_time instead of timespec
for current and absolute. Then I think it would be enough to compare their
ticks.
Since sub-millisecond sleep times are not guaranteed as suggested by
the vacuum_cost_delay docs ( see below ), an alternative idea
is to use clock_nanosleep for vacuum delay when it’s available, else
fallback to WaitLatch.
Wouldn't that increase even more the cases where sub-millisecond won't be
guaranteed?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
I did a few tests with the patch and did not see any "large" drifts like the
ones observed above.
Thanks for testing.
I think it could be "simplified" by making use of instr_time instead of timespec
for current and absolute. Then I think it would be enough to compare their
ticks.
Correct I attached a v2 of this patch that uses instr_time to check the elapsed
time and break out of the loop. It needs some more benchmarking.
Since sub-millisecond sleep times are not guaranteed as suggested by
the vacuum_cost_delay docs ( see below ), an alternative idea
is to use clock_nanosleep for vacuum delay when it’s available, else
fallback to WaitLatch.Wouldn't that increase even more the cases where sub-millisecond won't be
guaranteed?
Yes, nanosleep is going to provide the most coverage as it’s widely available.
Regards,
Sami

Attachments:
v2-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v2-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From f0b21250875726bd429bb9657b4037be41f745e4 Mon Sep 17 00:00:00 2001
From: Ubuntu <ubuntu@ip-172-31-89-33.ec2.internal>
Date: Thu, 11 Jul 2024 15:05:37 +0000
Subject: [PATCH 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..a65295b2ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,53 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining time.
+ * However, since nanosleep is susceptible to time drift when interrupted
+ * frequently, we add a safeguard to break out of the nanosleep whenever the
+ * total time of the sleep exceeds the requested sleep time. Using nanosleep
+ * is a more portable approach than clock_nanosleep.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ struct instr_time start_time;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while(nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ struct instr_time current_time;
+ struct instr_time elapsed_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ INSTR_TIME_SET_ZERO(elapsed_time);
+ INSTR_TIME_ACCUM_DIFF(elapsed_time, current_time, start_time);
+
+ if (INSTR_TIME_GET_MICROSEC(elapsed_time) >= microsec)
+ break;
+
+ delay = remain;
+ }
+
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2431,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
--
2.43.0
Correct I attached a v2 of this patch that uses instr_time to check the elapsed
time and break out of the loop. It needs some more benchmarking.
My email client has an issue sending attachments it seems. Reattaching
Attachments:
v2-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v2-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From f0b21250875726bd429bb9657b4037be41f745e4 Mon Sep 17 00:00:00 2001
From: Ubuntu <ubuntu@ip-172-31-89-33.ec2.internal>
Date: Thu, 11 Jul 2024 15:05:37 +0000
Subject: [PATCH 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..a65295b2ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,53 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining time.
+ * However, since nanosleep is susceptible to time drift when interrupted
+ * frequently, we add a safeguard to break out of the nanosleep whenever the
+ * total time of the sleep exceeds the requested sleep time. Using nanosleep
+ * is a more portable approach than clock_nanosleep.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ struct instr_time start_time;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while(nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ struct instr_time current_time;
+ struct instr_time elapsed_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ INSTR_TIME_SET_ZERO(elapsed_time);
+ INSTR_TIME_ACCUM_DIFF(elapsed_time, current_time, start_time);
+
+ if (INSTR_TIME_GET_MICROSEC(elapsed_time) >= microsec)
+ break;
+
+ delay = remain;
+ }
+
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2431,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
--
2.43.0
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining time.
+ * However, since nanosleep is susceptible to time drift when interrupted
+ * frequently, we add a safeguard to break out of the nanosleep whenever the
+ * total time of the sleep exceeds the requested sleep time. Using nanosleep
+ * is a more portable approach than clock_nanosleep.
+ */
I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at
the bottom of the while loop to avoid needing this extra check. Also, I
think we need some commentary about why we want to retry after an interrupt
in this case.
--
nathan
I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at
the bottom of the while loop to avoid needing this extra check.
Can you elaborate further? I am not sure how this will work since delay is a timespec
and elapsed time is an instr_time.
Also, in every iteration of the loop, the delay must be set to the remaining time. The
purpose of the elapsed_time is to make sure that we don’t surpass requested time
delay as an additional safeguard.
Also, I
think we need some commentary about why we want to retry after an interrupt
in this case.
I will elaborate further in the comments for the next revision.
Regards,
Sami
On Thu, Jul 11, 2024 at 01:10:25PM -0500, Sami Imseih wrote:
I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at
the bottom of the while loop to avoid needing this extra check.Can you elaborate further? I am not sure how this will work since delay is a timespec
and elapsed time is an instr_time.Also, in every iteration of the loop, the delay must be set to the remaining time. The
purpose of the elapsed_time is to make sure that we don�t surpass requested time
delay as an additional safeguard.
I'm imagining something like this:
struct timespec delay;
TimestampTz end_time;
end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);
do
{
long secs;
int microsecs;
TimestampDifference(GetCurrentTimestamp(), end_time,
&secs, µsecs);
delay.tv_sec = secs;
delay.tv_nsec = microsecs * 1000;
} while (nanosleep(&delay, NULL) == -1 && errno == EINTR);
--
nathan
Hi,
On Thu, Jul 11, 2024 at 10:15:41AM -0500, Sami Imseih wrote:
I did a few tests with the patch and did not see any "large" drifts like the
ones observed above.Thanks for testing.
I think it could be "simplified" by making use of instr_time instead of timespec
for current and absolute. Then I think it would be enough to compare their
ticks.Correct I attached a v2 of this patch that uses instr_time to check the elapsed
time and break out of the loop. It needs some more benchmarking.
Thanks!
Outside of Nathan's comment:
1 ===
+ * However, since nanosleep is susceptible to time drift when interrupted
+ * frequently, we add a safeguard to break out of the nanosleep whenever the
I'm not sure that "nanosleep is susceptible to time drift when interrupted frequently"
is a correct wording.
What about?
"
However, since the time between frequent interruptions and restarts of the
nanosleep calls can substantially lead to drift in the time when the sleep
finally completes, we add...."
2 ===
+static void vacuum_sleep(double msec);
What about a more generic name that could be used outside of vacuum?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 2024-Jul-11, Nathan Bossart wrote:
I'm imagining something like this:
struct timespec delay;
TimestampTz end_time;end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);
do
{
long secs;
int microsecs;TimestampDifference(GetCurrentTimestamp(), end_time,
&secs, µsecs);delay.tv_sec = secs;
delay.tv_nsec = microsecs * 1000;} while (nanosleep(&delay, NULL) == -1 && errno == EINTR);
This looks nicer. We could deal with clock drift easily (in case the
sysadmin winds the clock back) by testing that tv_sec+tv_nsec is not
higher than the initial time to sleep. I don't know how common this
situation is nowadays, but I remember debugging a system years ago where
autovacuum was sleeping for a very long time because of that. I can't
remember now if we did anything in the code to cope, or just told
sysadmins not to do that anymore :-)
FWIW my (Linux's) nanosleep() manpage contains this note:
If the interval specified in req is not an exact multiple of the granu‐
larity underlying clock (see time(7)), then the interval will be rounded
up to the next multiple. Furthermore, after the sleep completes, there
may still be a delay before the CPU becomes free to once again execute
the calling thread.
It's not clear to me what happens if the time to sleep is zero, so maybe
there should be a "if tv_sec == 0 && tv_nsec == 0 then break" statement
at the bottom of the loop, to quit without sleeping one more time than
needed.
For Windows, this [1]https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/ looks like an interesting and possibly relevant
read (though maybe SleepEx already does what we want to do here.)
[1]: https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)
I'm imagining something like this:
struct timespec delay;
TimestampTz end_time;end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);
do
{
long secs;
int microsecs;TimestampDifference(GetCurrentTimestamp(), end_time,
&secs, µsecs);delay.tv_sec = secs;
delay.tv_nsec = microsecs * 1000;} while (nanosleep(&delay, NULL) == -1 && errno == EINTR);
I do agree that this is cleaner code, but I am not sure I like this.
1/ TimestampDifference has a dependency on gettimeofday,
while my proposal utilizes clock_gettime. There are old discussions
that did not reach a conclusion comparing both mechanisms.
My main conclusion from these hacker discussions [1]/messages/by-id/31856.1400021891@sss.pgh.pa.us, [2]/messages/by-id/E1cO7fR-0003y0-9E@gemulon.postgresql.org and other
online discussions on the topic is clock_gettime should replace
getimeofday when possible. Precision is the main reason.
2/ It no longer uses the remain time. I think the remain time
is still required here. I did a unrealistic stress test which shows
the original proposal can handle frequent interruptions much better.
#1 in one session kicked off a vacuum
set vacuum_cost_delay = 10;
set vacuum_cost_limit = 1;
set client_min_messages = log;
update large_tbl set version = 1;
vacuum (verbose, parallel 4) large_tbl;
#2 in another session, ran a loop to continually
interrupt the vacuum leader. This was during the
“heap scan” phase of the vacuum.
PID=< pid of vacuum leader >
while :
do
kill -USR1 $PID
done
Using the proposed loop with the remainder, I noticed that
the actual time reported remains close to the requested
delay time.
LOG: 10.000000,10.013420
LOG: 10.000000,10.011188
LOG: 10.000000,10.010860
LOG: 10.000000,10.014839
LOG: 10.000000,10.004542
LOG: 10.000000,10.006035
LOG: 10.000000,10.012230
LOG: 10.000000,10.014535
LOG: 10.000000,10.009645
LOG: 10.000000,10.000817
LOG: 10.000000,10.002162
LOG: 10.000000,10.011721
LOG: 10.000000,10.011655
Using the approach mentioned by Nathan, there
are large differences between requested and actual time.
LOG: 10.000000,17.801778
LOG: 10.000000,12.795450
LOG: 10.000000,11.793723
LOG: 10.000000,11.796317
LOG: 10.000000,13.785993
LOG: 10.000000,11.803775
LOG: 10.000000,15.782767
LOG: 10.000000,31.783901
LOG: 10.000000,19.792440
LOG: 10.000000,21.795795
LOG: 10.000000,18.800412
LOG: 10.000000,16.782886
LOG: 10.000000,10.795197
LOG: 10.000000,14.793333
LOG: 10.000000,29.806556
LOG: 10.000000,18.810784
LOG: 10.000000,11.804956
LOG: 10.000000,24.809812
LOG: 10.000000,25.815600
LOG: 10.000000,22.809493
LOG: 10.000000,22.790908
LOG: 10.000000,19.699097
LOG: 10.000000,23.795613
LOG: 10.000000,24.797078
Let me know what you think?
[1]: /messages/by-id/31856.1400021891@sss.pgh.pa.us
[2]: /messages/by-id/E1cO7fR-0003y0-9E@gemulon.postgresql.org
Regards,
Sami
On Fri, Jul 12, 2024 at 12:14:56PM -0500, Sami Imseih wrote:
1/ TimestampDifference has a dependency on gettimeofday,
while my proposal utilizes clock_gettime. There are old discussions
that did not reach a conclusion comparing both mechanisms.
My main conclusion from these hacker discussions [1], [2] and other
online discussions on the topic is clock_gettime should replace
getimeofday when possible. Precision is the main reason.2/ It no longer uses the remain time. I think the remain time
is still required here. I did a unrealistic stress test which shows
the original proposal can handle frequent interruptions much better.
My comment was mostly about coding style and not about gettimeofday()
versus clock_gettime(). What does your testing show when you don't have
the extra check, i.e.,
struct timespec delay;
struct timespec remain;
delay.tv_sec = microsec / 1000000L;
delay.tv_nsec = (microsec % 1000000L) * 1000;
while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
delay = remain;
--
nathan
What does your testing show when you don't have
the extra check, i.e.,struct timespec delay;
struct timespec remain;delay.tv_sec = microsec / 1000000L;
delay.tv_nsec = (microsec % 1000000L) * 1000;while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
delay = remain;
This is similar to the first attempt [1]/messages/by-id/7D50DC5B-80C6-47B5-8DA8-A6C68A115EE5@gmail.com,
+pg_usleep_handle_interrupt(long microsec, bool force)
{
if (microsec > 0)
{
#ifndef WIN32
struct timespec delay;
+ struct timespec remaining;
delay.tv_sec = microsec / 1000000L;
delay.tv_nsec = (microsec % 1000000L) * 1000;
- (void) nanosleep(&delay, NULL);
+
+ if (force)
+ while (nanosleep(&delay, &remaining) == -1 && errno == EINTR)
+ delay = remaining;
but Bertrand found long drifts [2[ which I could not reproduce.
To safeguard the long drifts, continue to use the &remain time with an
additional safeguard to make sure the actual sleep does not exceed the
requested sleep time.
[1]: /messages/by-id/7D50DC5B-80C6-47B5-8DA8-A6C68A115EE5@gmail.com
[2]: /messages/by-id/ZoPC5IeP4k7sZpki@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
Sami
On Fri, Jul 12, 2024 at 03:39:57PM -0500, Sami Imseih wrote:
but Bertrand found long drifts [2[ which I could not reproduce.
To safeguard the long drifts, continue to use the &remain time with an
additional safeguard to make sure the actual sleep does not exceed the
requested sleep time.
Bertrand, does this safeguard fix the long drifts you saw?
--
nathan
Hi,
On Mon, Jul 15, 2024 at 12:20:29PM -0500, Nathan Bossart wrote:
On Fri, Jul 12, 2024 at 03:39:57PM -0500, Sami Imseih wrote:
but Bertrand found long drifts [2[ which I could not reproduce.
To safeguard the long drifts, continue to use the &remain time with an
additional safeguard to make sure the actual sleep does not exceed the
requested sleep time.Bertrand, does this safeguard fix the long drifts you saw?
Yeah, it was the case with the first version using the safeguard (see [1]/messages/by-id/Zo0UdeE3i9d0Wt5E@ip-10-97-1-34.eu-west-3.compute.internal) and
it's also the case with the last one shared in [2]/messages/by-id/C18017A1-EDFD-4F2F-BCA1-0572D4CCC92B@gmail.com.
[1]: /messages/by-id/Zo0UdeE3i9d0Wt5E@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/C18017A1-EDFD-4F2F-BCA1-0572D4CCC92B@gmail.com
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
I am attaching v3 of the patch which addresses the comments made
earlier by Bertrand about the comment in the patch [1]/messages/by-id/ZpDhS4nFX66ItAze@ip-10-97-1-34.eu-west-3.compute.internal. Also I will stick with
vacuum_sleep as the name as the function will be inside vacuum.c. I am not
sure we should make this function available outside of vacuum, but I would like
to hear other thoughts.
Also, earlier in the thread, Alvaro mentions what happens
if the sleep time is 0 [2]/messages/by-id/202407120939.pr6wpjffmxov@alvherre.pgsql. In that case, we do not do anything as we check
if sleep time is > 0 microseconds before proceeding with the sleep
[1]: /messages/by-id/ZpDhS4nFX66ItAze@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/202407120939.pr6wpjffmxov@alvherre.pgsql
Regards,
Sami

Attachments:
v3-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v3-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From f77cf0f831df74e5e2d11910cea395f5f4541d25 Mon Sep 17 00:00:00 2001
From: Sami Imseih <samimseih@gmail.com>
Date: Thu, 25 Jul 2024 20:28:35 +0000
Subject: [PATCH 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..a753d6984f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,53 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void
+vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the
+ * remaining time. However, frequent interruptions and restarts of the
+ * nanosleep calls can substantially lead to drift in the time when
+ * the sleep finally completes. To deal with this, we break out of the
+ * loop whenever the elapsed time exceeds the requested time.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ struct instr_time start_time;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ struct instr_time current_time;
+ struct instr_time elapsed_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ INSTR_TIME_SET_ZERO(elapsed_time);
+ INSTR_TIME_ACCUM_DIFF(elapsed_time, current_time, start_time);
+
+ if (INSTR_TIME_GET_MICROSEC(elapsed_time) >= microsec)
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2431,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
--
2.43.0
attaching the patch again. Something is strange with my email client.
Regards,
Sami
Attachments:
v3-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v3-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From f77cf0f831df74e5e2d11910cea395f5f4541d25 Mon Sep 17 00:00:00 2001
From: Sami Imseih <samimseih@gmail.com>
Date: Thu, 25 Jul 2024 20:28:35 +0000
Subject: [PATCH 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..a753d6984f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,53 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void
+vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the
+ * remaining time. However, frequent interruptions and restarts of the
+ * nanosleep calls can substantially lead to drift in the time when
+ * the sleep finally completes. To deal with this, we break out of the
+ * loop whenever the elapsed time exceeds the requested time.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ struct instr_time start_time;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ struct instr_time current_time;
+ struct instr_time elapsed_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ INSTR_TIME_SET_ZERO(elapsed_time);
+ INSTR_TIME_ACCUM_DIFF(elapsed_time, current_time, start_time);
+
+ if (INSTR_TIME_GET_MICROSEC(elapsed_time) >= microsec)
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2431,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
--
2.43.0
Hi,
On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote:
I am attaching v3 of the patch which addresses the comments made
earlier by Bertrand about the comment in the patch [1].
Thanks!
Looking at it:
1 ===
+ struct instr_time start_time;
I think we can get rid of the "struct" keyword here.
2 ===
+ struct instr_time current_time;
+ struct instr_time elapsed_time;
Same as above.
3 ===
I gave more thoughts and I think it can be simplified a bit to reduce the
number of operations in the while loop.
What about relying on a "absolute" time that way:
instr_time absolute;
absolute.ticks = start_time.ticks + msec * 1000000;
and then in the while loop:
while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
{
instr_time current_time;
INSTR_TIME_SET_CURRENT(current_time);
if (current_time.ticks > absolute.ticks)
{
break;
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
Hi,
On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote:
I am attaching v3 of the patch which addresses the comments made
earlier by Bertrand about the comment in the patch [1].Thanks!
Looking at it:
1 ===
+ struct instr_time start_time;
I think we can get rid of the "struct" keyword here.
2 ===
+ struct instr_time current_time; + struct instr_time elapsed_time;Same as above.
Will fix those 2.
3 ===
I gave more thoughts and I think it can be simplified a bit to reduce the
number of operations in the while loop.What about relying on a "absolute" time that way:
instr_time absolute;
absolute.ticks = start_time.ticks + msec * 1000000;and then in the while loop:
while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
{
instr_time current_time;
INSTR_TIME_SET_CURRENT(current_time);if (current_time.ticks > absolute.ticks)
{
break;
While I agree this code is cleaner, myy hesitation there is we don’t
have any other place in which we access .ticks directly and the
common practice is to use the intsr_time.h APIs.
What do you think?
Regards,
Sami
Hi,
On Mon, Jul 29, 2024 at 06:15:49PM -0500, Sami Imseih wrote:
On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
3 ===I gave more thoughts and I think it can be simplified a bit to reduce the
number of operations in the while loop.What about relying on a "absolute" time that way:
instr_time absolute;
absolute.ticks = start_time.ticks + msec * 1000000;and then in the while loop:
while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
{
instr_time current_time;
INSTR_TIME_SET_CURRENT(current_time);if (current_time.ticks > absolute.ticks)
{
break;While I agree this code is cleaner, myy hesitation there is we don’t
have any other place in which we access .ticks directly and the
common practice is to use the intsr_time.h APIs.
yeah, we already have a few macros that access the .ticks, so maybe we could add
2 new ones, say:
1. INSTR_TIME_ADD_MS(t1, msec)
2. INSTR_TIME_IS_GREATER(t1, t2)
I think the less operations is done in the while loop the better.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
yeah, we already have a few macros that access the .ticks, so maybe we could add
2 new ones, say:1. INSTR_TIME_ADD_MS(t1, msec)
2. INSTR_TIME_IS_GREATER(t1, t2)I think the less operations is done in the while loop the better.
See v4. it includes 2 new instr_time.h macros to simplify the
code insidethe while loop.
Regards,
Sami
Attachments:
v4-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v4-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From 033273772ecd180f6686ae79ef861f784684822e Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Mon, 5 Aug 2024 14:38:24 -0500
Subject: [PATCH v4 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 47 +++++++++++++++++++++++++++-
src/include/portability/instr_time.h | 6 ++++
2 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..0ed188a083 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,51 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void
+vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the
+ * remaining time. However, frequent interruptions and restarts of the
+ * nanosleep calls can substantially lead to drift in the time when
+ * the sleep finally completes. To deal with this, we break out of the
+ * loop whenever the current time is past the expected end time of the
+ * sleep.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2429,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index a6fc1922f2..36f136b08f 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -194,4 +194,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += t * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((bool) (x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
--
2.39.3 (Apple Git-146)
Hi,
On Mon, Aug 05, 2024 at 03:07:34PM -0500, Sami Imseih wrote:
yeah, we already have a few macros that access the .ticks, so maybe we could add
2 new ones, say:1. INSTR_TIME_ADD_MS(t1, msec)
2. INSTR_TIME_IS_GREATER(t1, t2)I think the less operations is done in the while loop the better.
See v4. it includes 2 new instr_time.h macros to simplify the
code insidethe while loop.
Thanks!
1 ===
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((bool) (x).ticks > (y).ticks)
Around parentheses are missing, that should be ((bool) ((x).ticks > (y).ticks)).
I did not pay attention to it initially but found it was the culprit of breaks
not occuring (while my test case produces some).
That said, I don't think the cast is necessary here and that we could get rid of
it.
2 ===
What about providing a quick comment about the 2 new macros in header of
instr_time.h? (like it is done for the others macros)
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
v5 takes care of the latest comments by Bertrand.
Regards,
Sami
Attachments:
v5-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v5-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From db6922fd79d86057f241bfa949bea6e209fa9a8e Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Tue, 6 Aug 2024 12:02:47 -0500
Subject: [PATCH v5 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 47 +++++++++++++++++++++++++++-
src/include/portability/instr_time.h | 10 ++++++
2 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..0ed188a083 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,51 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void
+vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the
+ * remaining time. However, frequent interruptions and restarts of the
+ * nanosleep calls can substantially lead to drift in the time when
+ * the sleep finally completes. To deal with this, we break out of the
+ * loop whenever the current time is past the expected end time of the
+ * sleep.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2429,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 8f9ba2f151..f968be2d8a 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) get t in nanoseconds
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t in microseconds to a instr_time
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += t * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
--
2.39.3 (Apple Git-146)
Hi,
On Tue, Aug 06, 2024 at 12:36:44PM -0500, Sami Imseih wrote:
v5 takes care of the latest comments by Bertrand.
Thanks!
Please find attached a rebase version (due to 39a138fbef) and in passing I changed
one comment:
"add t in microseconds to a instr_time" -> "add t (in microseconds) to x"
Does that make sense to you?
That said, it looks good to me.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v6-0001-vaccum_delay-with-absolute-time-nanosleep.patchtext/x-diff; charset=us-asciiDownload
From 619d4fea1eada6fb65fec673bc9600f8502f9b00 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Wed, 7 Aug 2024 05:04:10 +0000
Subject: [PATCH v6] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 47 +++++++++++++++++++++++++++-
src/include/portability/instr_time.h | 10 ++++++
2 files changed, 56 insertions(+), 1 deletion(-)
81.6% src/backend/commands/
18.3% src/include/portability/
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..0ed188a083 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,51 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+static
+void
+vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the
+ * remaining time. However, frequent interruptions and restarts of the
+ * nanosleep calls can substantially lead to drift in the time when
+ * the sleep finally completes. To deal with this, we break out of the
+ * loop whenever the current time is past the expected end time of the
+ * sleep.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2429,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e66ecf34cd..754ea77c43 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds)
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += t * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
--
2.34.1
On Aug 7, 2024, at 1:00 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
add t (in microseconds) to x”
I was attempting to be more verbose in the comment,
but what you have above matches the format of
the other comments. I am ok with your revision.
Regards.
Sami
Hi,
On Wed, Aug 07, 2024 at 09:11:19AM -0500, Sami Imseih wrote:
On Aug 7, 2024, at 1:00 AM, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
add t (in microseconds) to x”
I was attempting to be more verbose in the comment,
but what you have above matches the format of
the other comments. I am ok with your revision.
Thanks!
I'm marking the CF entry [1]https://commitfest.postgresql.org/49/5161/, as RFC.
[1]: https://commitfest.postgresql.org/49/5161/
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 07, 2024 at 06:00:53AM +0000, Bertrand Drouvot wrote:
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
I think this deserves a comment.
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += t * NS_PER_US)
I'd add parentheses around "t" to ensure any expressions given as "t" are
evaluated first.
Also, do we need to worry about overflow here? It looks like the rest of
instr_time.h is oblivious about overflow, so maybe this is better discussed
in a separate thread...
--
nathan
On Wed, Aug 07, 2024 at 06:00:53AM +0000, Bertrand Drouvot wrote:
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
I think this deserves a comment.
Done
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += t * NS_PER_US)I'd add parentheses around "t" to ensure any expressions given as "t" are
evaluated first.
Done
Also, do we need to worry about overflow here? It looks like the rest of
instr_time.h is oblivious about overflow, so maybe this is better discussed
in a separate thread...
I agree, this needs to be handled in a different thread.
Please see v7.
Regards,
Sami
Attachments:
v7-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v7-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From 83ef3f489e39f031d70f779582cdc276d0d415d1 Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Wed, 7 Aug 2024 10:09:00 -0500
Subject: [PATCH v7 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 57 +++++++++++++++++++++++++++-
src/include/portability/instr_time.h | 10 +++++
2 files changed, 66 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..4edffd3bc5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,6 +116,61 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
+static void vacuum_sleep(double msec);
+
+/*
+ * Delay vacuum for a specified time.
+ *
+ * Unlike pg_usleep, This function continues the delay in case of an
+ * interrupt.
+ */
+static void
+vacuum_sleep(double msec)
+{
+ long microsec = msec * 1000;
+
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ /*
+ * We allow nanosleep to handle interrupts and retry with the
+ * remaining time. However, frequent interruptions and restarts of the
+ * nanosleep calls can substantially lead to drift in the time when
+ * the sleep finally completes. To deal with this, we break out of the
+ * loop whenever the current time is past the expected end time of the
+ * sleep.
+ */
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+#else
+ /*
+ * Win32 uses SleepEx to perform the delay. It is non-interruptable,
+ * therefore no special handling is required to restart the sleep with
+ * remaining time.
+ */
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
/*
* GUC check function to ensure GUC value specified is within the allowable
@@ -2384,7 +2439,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ vacuum_sleep(msec);
pgstat_report_wait_end();
/*
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e66ecf34cd..6e4b0f1b17 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds)
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += (t) * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
--
2.39.3 (Apple Git-146)
Hi,
On Wed, Aug 07, 2024 at 09:36:59AM -0500, Nathan Bossart wrote:
Also, do we need to worry about overflow here? It looks like the rest of
instr_time.h is oblivious about overflow, so maybe this is better discussed
in a separate thread...
Yeah, a separate thread would be better.
FWIW and just out of curiosity:
1. it seems to me that most of the time (always?) we are manipulating instr_time(s)
as interval(s) which (with int64) gives “space” for about 292 years interval time
measurement (if my maths are correct).
2. for "absolute" manipulation (if any) it would depend of the PG_INSTR_CLOCK.
A "man clock_gettime" mentions:
2.1 CLOCK_MONOTONIC: on Linux, time since the system was booted. Not sure what
the longest Linux uptime record is but can't be more than since the 90's.
2.2 CLOCK_REALTIME: Its time represents seconds and nanoseconds since the Epoch.
It means that we’re currently about 237 years away from the limit. So even,
if we were to say add 2 "recents" of them we are still about 183 years away from
the limit.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Aug 08, 2024 at 06:49:27AM +0000, Bertrand Drouvot wrote:
2.2 CLOCK_REALTIME: Its time represents seconds and nanoseconds since the Epoch.
It means that we�re currently about 237 years away from the limit. So even,
if we were to say add 2 "recents" of them we are still about 183 years away from
the limit.
I hope to be retired before then.
--
nathan
On Wed, Aug 07, 2024 at 04:22:22PM -0500, Sami Imseih wrote:
Please see v7.
Thanks. This one looks pretty good to me, and so I plan to commit it in
the near future unless anyone voices concerns about the approach.
--
nathan
On Thu, Aug 08, 2024 at 03:01:20PM -0500, Nathan Bossart wrote:
Thanks. This one looks pretty good to me, and so I plan to commit it in
the near future unless anyone voices concerns about the approach.
As I am preparing this for commit, I'm wondering whether it makes sense to
name the new function vacuum_sleep() and keep it private to vacuum.c.
Nothing about this function is terribly specific to vacuum, and it's not
inconceivable that it might be useful elsewhere. Perhaps we should move it
to pgsleep.c and rename it to something to the effect of
pg_usleep_non_interruptable().
--
nathan
Hi,
On Fri, Aug 09, 2024 at 02:03:55PM -0500, Nathan Bossart wrote:
On Thu, Aug 08, 2024 at 03:01:20PM -0500, Nathan Bossart wrote:
Thanks. This one looks pretty good to me, and so I plan to commit it in
the near future unless anyone voices concerns about the approach.As I am preparing this for commit, I'm wondering whether it makes sense to
name the new function vacuum_sleep() and keep it private to vacuum.c.
Nothing about this function is terribly specific to vacuum, and it's not
inconceivable that it might be useful elsewhere. Perhaps we should move it
to pgsleep.c and rename it to something to the effect of
pg_usleep_non_interruptable().
Yeah, I had the same thought in [1]/messages/by-id/ZpDhS4nFX66ItAze@ip-10-97-1-34.eu-west-3.compute.internal, so +1.
[1]: /messages/by-id/ZpDhS4nFX66ItAze@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Yeah, I had the same thought in [1], so +1.
[1]: /messages/by-id/ZpDhS4nFX66ItAze@ip-10-97-1-34.eu-west-3.compute.internal
The intention ( see start of the thread ) was to make this a general API,
but I was not sure if there are use cases outside of vacuum.c.
In v8, I moved the function to pgsleep.c/signals.c and called it pg_usleep_non_interruptible.
The function, unlike vacuum_sleep, will take in micros as an argument to match with pg_usleep.
Regards
Sami
Attachments:
v8-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v8-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From 794e7fa0ccb0b419f031293fa5cb10c7dafd6bc6 Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Fri, 9 Aug 2024 18:02:12 -0500
Subject: [PATCH v8 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 2 +-
src/backend/port/win32/signal.c | 9 ++++++
src/include/port.h | 1 +
src/include/portability/instr_time.h | 10 +++++++
src/port/pgsleep.c | 41 ++++++++++++++++++++++++++++
5 files changed, 62 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..43333c4698 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2384,7 +2384,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ pg_usleep_non_interruptible(msec * 1000);
pgstat_report_wait_end();
/*
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 285cb611b4..96eb281964 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -73,6 +73,15 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ * Unlike pg_usleep, this relies on a non-interruptible sleep.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+}
/* Initialization */
void
diff --git a/src/include/port.h b/src/include/port.h
index c740005267..c8ff23e5ee 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -162,6 +162,7 @@ extern int pg_disable_aslr(void);
/* Portable delay handling */
extern void pg_usleep(long microsec);
+extern void pg_usleep_non_interruptible(long microsec);
/* Portable SQL-like case-independent comparisons and conversions */
extern int pg_strcasecmp(const char *s1, const char *s2);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e66ecf34cd..6e4b0f1b17 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds)
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += (t) * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 1284458bfc..e8df4bfb3f 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -14,6 +14,8 @@
#include <time.h>
+#include "portability/instr_time.h"
+
/*
* In a Windows backend, we don't use this implementation, but rather
* the signal-aware version in src/backend/port/win32/signal.c.
@@ -54,4 +56,43 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ * Unlike pg_usleep, this relies on a non-interruptible sleep.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining
+ * time. However, frequent interruptions and restarts of the nanosleep
+ * calls can substantially lead to drift in the time when the sleep
+ * finally completes. To deal with this, we break out of the loop whenever
+ * the current time is past the expected end time of the sleep.
+ */
+
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+}
+
+
#endif /* defined(FRONTEND) || !defined(WIN32) */
--
2.39.3 (Apple Git-146)
v9 has some has some minor corrections to the comments.
Regards,
Sami
Attachments:
v9-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v9-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From 0fecbea4fc4c49a9f8657005095e1cab081451be Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Fri, 9 Aug 2024 18:02:12 -0500
Subject: [PATCH v9 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 2 +-
src/backend/port/win32/signal.c | 10 +++++++
src/include/port.h | 1 +
src/include/portability/instr_time.h | 10 +++++++
src/port/pgsleep.c | 43 ++++++++++++++++++++++++++++
5 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..43333c4698 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2384,7 +2384,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ pg_usleep_non_interruptible(msec * 1000);
pgstat_report_wait_end();
/*
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 285cb611b4..edcb181215 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -73,6 +73,16 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ *
+ * Unlike pg_usleep, this relies on a non-interruptible sleep.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+}
/* Initialization */
void
diff --git a/src/include/port.h b/src/include/port.h
index c740005267..c8ff23e5ee 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -162,6 +162,7 @@ extern int pg_disable_aslr(void);
/* Portable delay handling */
extern void pg_usleep(long microsec);
+extern void pg_usleep_non_interruptible(long microsec);
/* Portable SQL-like case-independent comparisons and conversions */
extern int pg_strcasecmp(const char *s1, const char *s2);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e66ecf34cd..6e4b0f1b17 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds)
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += (t) * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 1284458bfc..9525156abc 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -14,6 +14,8 @@
#include <time.h>
+#include "portability/instr_time.h"
+
/*
* In a Windows backend, we don't use this implementation, but rather
* the signal-aware version in src/backend/port/win32/signal.c.
@@ -54,4 +56,45 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ *
+ * Unlike pg_usleep, This function continues the delay in case of an
+ * interrupt.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining
+ * time. However, frequent interruptions and restarts of the nanosleep
+ * calls can substantially lead to drift in the time when the sleep
+ * finally completes. To deal with this, we break out of the loop whenever
+ * the current time is past the expected end time of the sleep.
+ */
+
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+}
+
+
#endif /* defined(FRONTEND) || !defined(WIN32) */
--
2.39.3 (Apple Git-146)
Hi,
On Sat, Aug 10, 2024 at 08:27:56AM -0500, Sami Imseih wrote:
v9 has some has some minor corrections to the comments.
Thanks!
1 ===
+ * Unlike pg_usleep, This function continues
s/This/this/?
Apart from the above, LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
mazon Web Services: https://aws.amazon.com
Show quoted text
+ * Unlike pg_usleep, This function continues
s/This/this/?
Apart from the above, LGTM.
Attachments:
v10-0001-vaccum_delay-with-absolute-time-nanosleep.patchapplication/octet-stream; name=v10-0001-vaccum_delay-with-absolute-time-nanosleep.patch; x-unix-mode=0644Download
From fc1a3c0038b712c5ac2aa846ca78a2b80cfa47ac Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Mon, 12 Aug 2024 10:09:43 -0500
Subject: [PATCH v10 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 2 +-
src/backend/port/win32/signal.c | 10 +++++++
src/include/port.h | 1 +
src/include/portability/instr_time.h | 10 +++++++
src/port/pgsleep.c | 43 ++++++++++++++++++++++++++++
5 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..43333c4698 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2384,7 +2384,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ pg_usleep_non_interruptible(msec * 1000);
pgstat_report_wait_end();
/*
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 285cb611b4..edcb181215 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -73,6 +73,16 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ *
+ * Unlike pg_usleep, this relies on a non-interruptible sleep.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+}
/* Initialization */
void
diff --git a/src/include/port.h b/src/include/port.h
index c740005267..c8ff23e5ee 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -162,6 +162,7 @@ extern int pg_disable_aslr(void);
/* Portable delay handling */
extern void pg_usleep(long microsec);
+extern void pg_usleep_non_interruptible(long microsec);
/* Portable SQL-like case-independent comparisons and conversions */
extern int pg_strcasecmp(const char *s1, const char *s2);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e66ecf34cd..6e4b0f1b17 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds)
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += (t) * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 1284458bfc..d558783ff9 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -14,6 +14,8 @@
#include <time.h>
+#include "portability/instr_time.h"
+
/*
* In a Windows backend, we don't use this implementation, but rather
* the signal-aware version in src/backend/port/win32/signal.c.
@@ -54,4 +56,45 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ *
+ * Unlike pg_usleep, this function continues the delay in case of an
+ * interrupt.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining
+ * time. However, frequent interruptions and restarts of the nanosleep
+ * calls can substantially lead to drift in the time when the sleep
+ * finally completes. To deal with this, we break out of the loop whenever
+ * the current time is past the expected end time of the sleep.
+ */
+
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+}
+
+
#endif /* defined(FRONTEND) || !defined(WIN32) */
--
2.39.3 (Apple Git-146)
My email client attached the last response for
some reason :(
v10 attached in the previous message addresses
Bertrands last comment and replaces “This” with “this"
Regards,
Sami
Hi,
On Mon, Aug 12, 2024 at 10:19:56AM -0500, Sami Imseih wrote:
v10 attached in the previous message addresses
Bertrands last comment and replaces “This” with “this"
Thanks, v10 LGTM.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Aug 12, 2024 at 03:56:18PM +0000, Bertrand Drouvot wrote:
Thanks, v10 LGTM.
As cfbot points out [0]https://cirrus-ci.com/task/6393555868975104, this is missing a Windows/frontend implementation.
[0]: https://cirrus-ci.com/task/6393555868975104
--
nathan
As cfbot points out [0], this is missing a Windows/frontend implementation.
Looks like the pgsleep implementation is missing the
#ifndef WIN32
I followed what is done in pg_usleep.
v11 should now build on Windows, hopefully.
Strangely, v10 build on a Windows machine I have locally.
Regards,
Sami
Attachments:
v11-0001-vaccum_delay-with-absolute-time-nanosleep.patchtext/plain; charset=UTF-8; name=v11-0001-vaccum_delay-with-absolute-time-nanosleep.patchDownload
From 60a6058c1960246b9fd431398bd68a77a0247a9a Mon Sep 17 00:00:00 2001
From: Sami Imseih <simseih@amazon.com>
Date: Mon, 12 Aug 2024 15:47:55 -0500
Subject: [PATCH v11 1/1] vaccum_delay with absolute time nanosleep
---
src/backend/commands/vacuum.c | 2 +-
src/backend/port/win32/signal.c | 10 ++++++
src/include/port.h | 1 +
src/include/portability/instr_time.h | 10 ++++++
src/port/pgsleep.c | 49 ++++++++++++++++++++++++++++
5 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..43333c4698 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2384,7 +2384,7 @@ vacuum_delay_point(void)
msec = vacuum_cost_delay * 4;
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
- pg_usleep(msec * 1000);
+ pg_usleep_non_interruptible(msec * 1000);
pgstat_report_wait_end();
/*
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index 285cb611b4..edcb181215 100644
--- a/src/backend/port/win32/signal.c
+++ b/src/backend/port/win32/signal.c
@@ -73,6 +73,16 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ *
+ * Unlike pg_usleep, this relies on a non-interruptible sleep.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+}
/* Initialization */
void
diff --git a/src/include/port.h b/src/include/port.h
index c740005267..c8ff23e5ee 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -162,6 +162,7 @@ extern int pg_disable_aslr(void);
/* Portable delay handling */
extern void pg_usleep(long microsec);
+extern void pg_usleep_non_interruptible(long microsec);
/* Portable SQL-like case-independent comparisons and conversions */
extern int pg_strcasecmp(const char *s1, const char *s2);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e66ecf34cd..6e4b0f1b17 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -36,6 +36,10 @@
*
* INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds)
*
+ * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x
+ *
+ * INSTR_TIME_IS_GREATER(x,y) is x greater than y?
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -194,4 +198,10 @@ GetTimerFrequency(void)
#define INSTR_TIME_GET_MICROSEC(t) \
(INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+#define INSTR_TIME_ADD_MICROSEC(x,t) \
+ ((x).ticks += (t) * NS_PER_US)
+
+#define INSTR_TIME_IS_GREATER(x,y) \
+ ((x).ticks > (y).ticks)
+
#endif /* INSTR_TIME_H */
diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 1284458bfc..63523a6e41 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -14,6 +14,8 @@
#include <time.h>
+#include "portability/instr_time.h"
+
/*
* In a Windows backend, we don't use this implementation, but rather
* the signal-aware version in src/backend/port/win32/signal.c.
@@ -54,4 +56,51 @@ pg_usleep(long microsec)
}
}
+/*
+ * pg_usleep_non_interruptible --- delay the specified number of microseconds.
+ *
+ * Unlike pg_usleep, this function continues the delay in case of an
+ * interrupt.
+ */
+void
+pg_usleep_non_interruptible(long microsec)
+{
+ /*
+ * We allow nanosleep to handle interrupts and retry with the remaining
+ * time. However, frequent interruptions and restarts of the nanosleep
+ * calls can substantially lead to drift in the time when the sleep
+ * finally completes. To deal with this, we break out of the loop whenever
+ * the current time is past the expected end time of the sleep.
+ */
+ if (microsec > 0)
+ {
+#ifndef WIN32
+ struct timespec delay;
+ struct timespec remain;
+ instr_time end_time;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_ADD_MICROSEC(end_time, microsec);
+
+ delay.tv_sec = microsec / 1000000L;
+ delay.tv_nsec = (microsec % 1000000L) * 1000;
+
+ while (nanosleep(&delay, &remain) == -1 && errno == EINTR)
+ {
+ instr_time current_time;
+
+ INSTR_TIME_SET_CURRENT(current_time);
+
+ if (INSTR_TIME_IS_GREATER(current_time, end_time))
+ break;
+
+ delay = remain;
+ }
+#else
+ SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+#endif
+ }
+}
+
+
#endif /* defined(FRONTEND) || !defined(WIN32) */
--
2.39.3 (Apple Git-146)
I'm trying to understand what the point of this patch is, so I went to
read this thread from the beginning:
In the proposal by Bertrand [1] to implement vacuum cost delay tracking
in pg_stat_progress_vacuum, it was discovered that the vacuum cost delay
ends early on the leader process of a parallel vacuum due to parallel workers > reporting progress on index vacuuming, which was introduced in 17
with commit [2]. With this patch, everytime a parallel worker
completes a vacuum index, it will send a completion message to the leader.
Ok, so we might sometimes skip the sleep, if an interrupt is received. I
agree that's a bit sloppy, but probably won't make any difference in
practice.
The facility that allows a parallel worker to report progress to the leader was
introduced in commit [3].In the case of the patch being proposed by Bertrand, the number of interrupts > will be much more frequent as parallel workers would send a message
to the leader
to update the vacuum delay counters every vacuum_delay_point call.
Hmm, I wonder if that's a good design, if it results in a lot of interrupts.
On the patch itself: Making the sleeps in vacuum uninterruptible means
that vacuum will be more slow to respond to interrupts. If you SIGTERM a
vacuum process, or hit CTRL-C, you *would* want to exit the sleep ASAP.
Tom raised that concern earlier in this thread
(/messages/by-id/2100439.1719610468@sss.pgh.pa.us),
but it seems the discussion wandered off to the details of how to do the
sleep, and left that unaddressed.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, Aug 13, 2024 at 12:04:28AM +0300, Heikki Linnakangas wrote:
In the case of the patch being proposed by Bertrand, the number of
interrupts > will be much more frequent as parallel workers would send a
messageto the leader
to update the vacuum delay counters every vacuum_delay_point call.
Hmm, I wonder if that's a good design, if it results in a lot of interrupts.
Skimming the last few messages of that thread [0]https://commitfest.postgresql.org/49/5027/, it looks like Bertrand
is exploring ways to avoid so many interrupts. I guess the unavoidable
question is whether this work is still worthwhile given that improvement.
On the patch itself: Making the sleeps in vacuum uninterruptible means that
vacuum will be more slow to respond to interrupts. If you SIGTERM a vacuum
process, or hit CTRL-C, you *would* want to exit the sleep ASAP.
Since the delay will typically be pretty small (2 milliseconds by default
for autovacuum), I'm assuming this won't ordinarily be noticeable. But I
do think it is an important consideration.
[0]: https://commitfest.postgresql.org/49/5027/
--
nathan
Skimming the last few messages of that thread [0], it looks like Bertrand
is exploring ways to avoid so many interrupts. I guess the unavoidable
question is whether this work is still worthwhile given that improvement.
The way the instrumentation in [0]https://commitfest.postgresql.org/49/5027/ dealt with interrupts was too complex,
which is why it seemed better to handle the restart the remainder of the
sleep in the sleep function
On the patch itself: Making the sleeps in vacuum uninterruptible means that
vacuum will be more slow to respond to interrupts. If you SIGTERM a vacuum
process, or hit CTRL-C, you *would* want to exit the sleep ASAP.Since the delay will typically be pretty small (2 milliseconds by default
for autovacuum), I'm assuming this won't ordinarily be noticeable. But I
do think it is an important consideration.
At most, the sleep will be 100ms for vacuum.
Tom raised that concern earlier in this thread
(/messages/by-id/2100439.1719610468@sss.pgh.pa.us),
but it seems the discussion wandered off to the details of how to do
the sleep, and left that unaddressed.
Doing something like pg_sleep, using WaitLatch [1]/messages/by-id/67072E39-3B4E-4240-8373-AC45E23721E7@gmail.com, was explored.
However this
does not support microsecond sleeps which was allowed in 720de00af49
Thomas shared WaitLatchUs [2]/messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com, which supports higher precision sleeps, but
it requires epoll_pwait(2) on the system, thus it's not very portable.
Using nanosleep with remain time and checking for drift was the most
portable
approach.
Regards,
Sami
[0]: https://commitfest.postgresql.org/49/5027/
[1]: /messages/by-id/67072E39-3B4E-4240-8373-AC45E23721E7@gmail.com
[2]: /messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
Hi,
On Mon, Aug 12, 2024 at 05:04:02PM -0500, Nathan Bossart wrote:
On Tue, Aug 13, 2024 at 12:04:28AM +0300, Heikki Linnakangas wrote:
In the case of the patch being proposed by Bertrand, the number of
interrupts > will be much more frequent as parallel workers would send a
messageto the leader
to update the vacuum delay counters every vacuum_delay_point call.
Hmm, I wonder if that's a good design, if it results in a lot of interrupts.
Skimming the last few messages of that thread [0], it looks like Bertrand
is exploring ways to avoid so many interrupts.
Yeah, that was mainly to avoid the side effect of the interrupts making the
vacuum faster as compared to the master branch (as in [0]https://commitfest.postgresql.org/49/5027/, the leader is not
honouring the delays when the parallel workers report their delayed time): that
could be noticeable depending of the amount of work and the number of parallel
workers involved in the vacuum.
That could be solved thanks to this thread. Once this thread is over (and whatever
the outcome is), I'll resume my testing as far the cost delay report is concerned.
I guess the unavoidable
question is whether this work is still worthwhile given that improvement.
The delay not being honored already affects the vacuum since we allow a parallel
worker to report progress to the leader (see [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1889729dd). The interrupts are far less
frequent (as compare to [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1889729dd) though.
[0]: https://commitfest.postgresql.org/49/5027/
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f1889729dd
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote:
Skimming the last few messages of that thread [0], it looks like Bertrand
is exploring ways to avoid so many interrupts. I guess the unavoidable
question is whether this work is still worthwhile given that improvement.The way the instrumentation in [0] dealt with interrupts was too complex,
which is why it seemed better to handle the restart the remainder of the
sleep in the sleep function
Can you elaborate on how it is too complex?
--
nathan
On 8/13/24 10:09 AM, Nathan Bossart wrote:
On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote:
Skimming the last few messages of that thread [0], it looks like Bertrand
is exploring ways to avoid so many interrupts. I guess the unavoidable
question is whether this work is still worthwhile given that improvement.The way the instrumentation in [0] dealt with interrupts was too complex,
which is why it seemed better to handle the restart the remainder of the
sleep in the sleep functionCan you elaborate on how it is too complex?
[0]: https://commitfest.postgresql.org/49/5027/
instrument cost_delay at an interval to reduce the number
of interrupts to the leader.
On the other hand, with allowing the sleep to deal with
interrupts,no additional logic to space out instrumentation
is required.
Regards,
Sami
On Tue, Aug 13, 2024 at 10:47:51AM -0500, Imseih (AWS), Sami wrote:
On 8/13/24 10:09 AM, Nathan Bossart wrote:
On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote:
Skimming the last few messages of that thread [0], it looks like Bertrand
is exploring ways to avoid so many interrupts. I guess the unavoidable
question is whether this work is still worthwhile given that improvement.The way the instrumentation in [0] dealt with interrupts was too complex,
which is why it seemed better to handle the restart the remainder of the
sleep in the sleep functionCan you elaborate on how it is too complex?
[0] made vacuum_delay_point more complex as it has to
instrument cost_delay at an interval to reduce the number
of interrupts to the leader.
Sure, but looking at the patch [0]/messages/by-id/ZnlPZZZJCRu/8fka@ip-10-97-1-34.eu-west-3.compute.internal, it adds maybe an extra 10 lines of code
to limit the reports to 1 Hz. That doesn't strike me as too complex...
[0]: /messages/by-id/ZnlPZZZJCRu/8fka@ip-10-97-1-34.eu-west-3.compute.internal
--
nathan
On 8/13/24 10:57 AM, Nathan Bossart wrote:
On Tue, Aug 13, 2024 at 10:47:51AM -0500, Imseih (AWS), Sami wrote:
On 8/13/24 10:09 AM, Nathan Bossart wrote:
On Mon, Aug 12, 2024 at 05:35:08PM -0500, Imseih (AWS), Sami wrote:
Skimming the last few messages of that thread [0], it looks like Bertrand
is exploring ways to avoid so many interrupts. I guess the unavoidable
question is whether this work is still worthwhile given that improvement.The way the instrumentation in [0] dealt with interrupts was too complex,
which is why it seemed better to handle the restart the remainder of the
sleep in the sleep functionCan you elaborate on how it is too complex?
[0] made vacuum_delay_point more complex as it has to
instrument cost_delay at an interval to reduce the number
of interrupts to the leader.Sure, but looking at the patch [0], it adds maybe an extra 10 lines of code
to limit the reports to 1 Hz. That doesn't strike me as too complex...[0] /messages/by-id/ZnlPZZZJCRu/8fka@ip-10-97-1-34.eu-west-3.compute.internal
Perhaps "complex" may not be the correct way to describe it.
Having to add special handling to space out instrumentation
directly in vacuum_delay_point seems very odd to me. I don't
think vacuum_delay_point should have to worry about this.
Also,
1/ what is an appropriate interval to collect these stats?
2/ What if there are other callers in the future that wish
to instrument parallel vacuum workers? they will need to implement
similar logic.
Regards,
Sami
Please disregards this point from the last reply:
"""
2/ What if there are other callers in the future that wish
to instrument parallel vacuum workers? they will need to implement
similar logic.
"""
I misspoke about this and this point does not matter since only
vacuum sleep matters for this current discussion.
Regards,
Sami
On Tue, Aug 13, 2024 at 11:07:46AM -0500, Imseih (AWS), Sami wrote:
Having to add special handling to space out instrumentation
directly in vacuum_delay_point seems very odd to me. I don't
think vacuum_delay_point should have to worry about this.Also,
1/ what is an appropriate interval to collect these stats?
2/ What if there are other callers in the future that wish
to instrument parallel vacuum workers? they will need to implement
similar logic.
None of this seems intractable to me. 1 Hz seems like an entirely
reasonable place to start, and it is very easy to change (or to even make
configurable). pg_stat_progress_vacuum might show slightly old values in
this column, but that should be easy enough to explain in the docs if we
are really concerned about it. If other callers want to do something
similar, maybe we should add a more generic implementation in
backend_progress.c.
--
nathan
None of this seems intractable to me. 1 Hz seems like an entirely
reasonable place to start, and it is very easy to change (or to even make
configurable). pg_stat_progress_vacuum might show slightly old values in
this column, but that should be easy enough to explain in the docs if we
are really concerned about it. If other callers want to do something
similar, maybe we should add a more generic implementation in
backend_progress.c.
I don't know if I agree. Making vacuum sleep more robust to handle
interrupts seems like a cleaner general solution than to add
even more code to handle this case or have to explain the behavior of
cost delay instrumentation in docs.
Regards,
Sami
On Tue, Aug 13, 2024 at 01:12:30PM -0500, Imseih (AWS), Sami wrote:
None of this seems intractable to me. 1 Hz seems like an entirely
reasonable place to start, and it is very easy to change (or to even make
configurable). pg_stat_progress_vacuum might show slightly old values in
this column, but that should be easy enough to explain in the docs if we
are really concerned about it. If other callers want to do something
similar, maybe we should add a more generic implementation in
backend_progress.c.I don't know if I agree. Making vacuum sleep more robust to handle
interrupts seems like a cleaner general solution than to add
even more code to handle this case or have to explain the behavior of
cost delay instrumentation in docs.
Another concern is the huge number of PqMsg_Progress messages sent by
parallel workers with that approach. In Bertrand's tests, he was seeing
nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per
second). That seems a bit extreme to me. I don't see how anyone could
possibly need stats about vacuum delays with that level of accuracy.
--
nathan
On Tue, Aug 13, 2024 at 4:30 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:
On Tue, Aug 13, 2024 at 01:12:30PM -0500, Imseih (AWS), Sami wrote:
None of this seems intractable to me. 1 Hz seems like an entirely
reasonable place to start, and it is very easy to change (or to evenmake
configurable). pg_stat_progress_vacuum might show slightly old values
in
this column, but that should be easy enough to explain in the docs if we
are really concerned about it. If other callers want to do something
similar, maybe we should add a more generic implementation in
backend_progress.c.I don't know if I agree. Making vacuum sleep more robust to handle
interrupts seems like a cleaner general solution than to add
even more code to handle this case or have to explain the behavior of
cost delay instrumentation in docs.Another concern is the huge number of PqMsg_Progress messages sent by
parallel workers with that approach. In Bertrand's tests, he was seeing
nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per
second). That seems a bit extreme to me. I don't see how anyone could
possibly need stats about vacuum delays with that level of accuracy.--
nathan
Fair point. If there is a clear benefit to spacing out the vacuum sleep
delay instrumentation, that could be taken up in that thread. This will
reduce the interrupts, but not eliminate them.
There could still be benefit to ensure that vacuum sleeps can deal
with interrupts and sleep the requested time consistently.
Regards,
Sami
Hi,
On Tue, Aug 13, 2024 at 04:30:46PM -0500, Nathan Bossart wrote:
On Tue, Aug 13, 2024 at 01:12:30PM -0500, Imseih (AWS), Sami wrote:
None of this seems intractable to me. 1 Hz seems like an entirely
reasonable place to start, and it is very easy to change (or to even make
configurable). pg_stat_progress_vacuum might show slightly old values in
this column, but that should be easy enough to explain in the docs if we
are really concerned about it. If other callers want to do something
similar, maybe we should add a more generic implementation in
backend_progress.c.I don't know if I agree. Making vacuum sleep more robust to handle
interrupts seems like a cleaner general solution than to add
even more code to handle this case or have to explain the behavior of
cost delay instrumentation in docs.Another concern is the huge number of PqMsg_Progress messages sent by
parallel workers with that approach. In Bertrand's tests, he was seeing
nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per
second). That seems a bit extreme to me. I don't see how anyone could
possibly need stats about vacuum delays with that level of accuracy.
I gave it more thoughts and I don't think we have to choose between the two.
The 1 Hz approach reduces the number of interrupts and Sami's patch provides a
way to get "accurate" delay in case of interrupts. I think both have their own
benefit.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 14, 2024 at 06:00:06AM +0000, Bertrand Drouvot wrote:
I gave it more thoughts and I don't think we have to choose between the two.
The 1 Hz approach reduces the number of interrupts and Sami's patch provides a
way to get "accurate" delay in case of interrupts. I think both have their own
benefit.
Is it really that important to delay with that level of accuracy? In most
cases, the chances of actually interrupting a given vacuum delay point are
pretty small. Even in the extreme scenario you tested with ~350K
interrupts in a 19 minute vacuum, you only saw a 10-15% difference in total
time. I wouldn't say I'm diametrically opposed to this patch, but I do
think we need to carefully consider whether it's worth the extra code.
Separately, I've been wondering whether it's worth allowing the sleep to be
interrupted in certain cases, such as SIGINT and SIGTERM. That should
address one of Heikki's points.
--
nathan
time. I wouldn't say I'm diametrically opposed to this patch, but I do
think we need to carefully consider whether it's worth the extra code.
FWIW, besides the patch that Bertrand is proposing [1]/messages/by-id/ZmaXmWDL829fzAVX@ip-10-97-1-34.eu-west-3.compute.internal, there is another parallel
vacuum case being discussed to allow for parallel heap scan [2]/messages/by-id/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com.
Being able to support both instrumentation of sleep time by a parallel workers
and ensuring that actual sleep times are as close as possible to the
requested times is a good think, IMO.
Separately, I've been wondering whether it's worth allowing the sleep to be
interrupted in certain cases, such as SIGINT and SIGTERM. That should
address one of Heikki's points.
An idea may be to check for pending interrupts inside the
pg_usleep_non_interruptible nanosleep loop. If there is a
pending interrupt and the interrupt is QueryCancelPending or
ClientConnectionLost, we can break out immediately.
I am not sure yet how this can work for Windows, since for
this patch, we are using a simple SleepEx call which is
non-interruptible anyhow.
Is it worth the effort and even more code to deal with specific
Interrupts for such short sleeps ( less than 100ms for vacuum at most )?
I am also thinking that pg_usleep_non_interruptuble routine should have
a cap on the sleep time allowed. That cap can be 100ms to match the
max vacuum_cost_delay. This will prevent anyone from trying to use
this API for much longer sleeps.
What do you think?
[1]: /messages/by-id/ZmaXmWDL829fzAVX@ip-10-97-1-34.eu-west-3.compute.internal
[2]: /messages/by-id/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
Regards,
Sami
On Wed, Aug 14, 2024 at 9:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
Another concern is the huge number of PqMsg_Progress messages sent by
parallel workers with that approach. In Bertrand's tests, he was seeing
nearly 350K interrupts for a ~19 minute vacuum (~300 interrupts per
second). That seems a bit extreme to me. I don't see how anyone could
possibly need stats about vacuum delays with that level of accuracy.
I suspect CF #5118 would fix lots of cases of ProcSignal() senders
going berserk, because it deletes SendProcSignal(), and introduces
SendInterrupt(), which calls SetLatch(), which doesn't send a signal if
the latch is already set. Even if the latch is not already set, it
only sends a signal if the latch is currently being waited on
("maybe_sleeping" flag). Even when it sends a signal, it goes to a
signalfd, kqueue or NT event flag on common platforms.
Of course that is only talking about the receiving side. I'm sure we
can improve the senders too. There's nothing we can do about NOTIFY,
because that's under user control, but that PqMsg_Progress case sounds
pretty bad, and the recovery conflict system could probably be made
more precise in its logic about who to wake up and when, etc.
Other backends going bananas with SendProcSignal() is the reason
dsm_impl_posix_resize() has to block signals while calling
posix_fallocate(). Unlike nanosleep(), which you can fix by tracking
remaining time, posix_fallocate() is all-or-nothing, it has no way to
report partial progress, so it must therefore undo its work if
interrupted, so its EINTR retry loop could get stuck forever when
other backends are trigger-happy with signals, which was a real
production issue. I guess both of these issues go away in practice if
CF #5118 goes in.
On Sun, Aug 18, 2024 at 11:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
I guess both of these issues go away in practice if
CF #5118 goes in.
To be more precise, if you just keep doing pg_usleep() the issue goes
away, and likewise for posix_fallocate() it goes away... But if you
switch to WaitLatchUs() so you can handle latch wakeups in vacuum
delays, which really you should because the latch might be an urgent
request for you to CHECK_FOR_INTERRUPTS(), because another backend is
waiting for all backends to service a ProcSignalBarrier (we need a new
name for that), well now you'll get wakeups, so you're back to square
one if someone is sending them very fast.
Hi,
On Thu, Aug 15, 2024 at 04:13:29PM -0500, Nathan Bossart wrote:
On Wed, Aug 14, 2024 at 06:00:06AM +0000, Bertrand Drouvot wrote:
I gave it more thoughts and I don't think we have to choose between the two.
The 1 Hz approach reduces the number of interrupts and Sami's patch provides a
way to get "accurate" delay in case of interrupts. I think both have their own
benefit.Is it really that important to delay with that level of accuracy? In most
cases, the chances of actually interrupting a given vacuum delay point are
pretty small. Even in the extreme scenario you tested with ~350K
interrupts in a 19 minute vacuum, you only saw a 10-15% difference in total
time. I wouldn't say I'm diametrically opposed to this patch, but I do
think we need to carefully consider whether it's worth the extra code.
I'm not 100% sure that it is worth it but on OTOH I'm thinking that could be the
case for someone that cares enough to change the cost delay from say 2ms to 3ms.
I mean, given the granularity we expose in the cost delay, one could expect to
get "accurate" delay. The doc is cautious enough to mention that "such delays may
not be measured accurately on older platforms" which makes me think that could
be worth to implement it.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Hi,
On Tue, Aug 13, 2024 at 11:40:27AM -0500, Nathan Bossart wrote:
On Tue, Aug 13, 2024 at 11:07:46AM -0500, Imseih (AWS), Sami wrote:
Having to add special handling to space out instrumentation
directly in vacuum_delay_point seems very odd to me. I don't
think vacuum_delay_point should have to worry about this.Also,
1/ what is an appropriate interval to collect these stats?
2/ What if there are other callers in the future that wish
to instrument parallel vacuum workers? they will need to implement
similar logic.None of this seems intractable to me. 1 Hz seems like an entirely
reasonable place to start, and it is very easy to change (or to even make
configurable). pg_stat_progress_vacuum might show slightly old values in
this column, but that should be easy enough to explain in the docs if we
are really concerned about it. If other callers want to do something
similar, maybe we should add a more generic implementation in
backend_progress.c.
As it looks like we have a consensus that reducing the number of interrupts also
makes sense, I just provided a rebase version of the 1 Hz version (see [0]/messages/by-id/ZsSQnS9OW9EWSOk4@ip-10-97-1-34.eu-west-3.compute.internal, that
also makes clear in the doc that the new field might show slightly old values).
[0]: /messages/by-id/ZsSQnS9OW9EWSOk4@ip-10-97-1-34.eu-west-3.compute.internal
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
As it looks like we have a consensus that reducing the number of interrupts also
makes sense, I just provided a rebase version of the 1 Hz version (see [0], that
also makes clear in the doc that the new field might show slightly old values).
That makes sense. However, I suspect the "1 Hz" work code will no longer
be needed if CF entry 5118 [1]https://commitfest.postgresql.org/49/5118/ mentioned by Thomas [2]/messages/by-id/CA+hUKG+f-nEc_SowDLW1JMUa6Of5sCK-JZ=v-KhL1xgXk83fiw@mail.gmail.com a few days back
goes through. Maybe this extra work can be removed if [1]https://commitfest.postgresql.org/49/5118/ goes through.
What do you think?
With regards to CF 5118 and what it means to the current discussion, below
are my thoughts.
I tested with both CF 5118 [1]https://commitfest.postgresql.org/49/5118/ and the cost-delay tracking patch. With that in place,
pg_usleep is able to sleep the full requested, as mentioned by Thomas [3]/messages/by-id/CA+hUKGKpo3fM=rnfdxHqt+jNGh_zUNcL1ob4hMsb=jFfKn9Aag@mail.gmail.com. This is
because certain interrupts like Parallel Message and others are not signaled
by SIGUSR1 any longer, but with latches.
From this discussion, there is desire for a sleep function that:
1/ Sleeps for the full duration of the requested time
2/ Continues to handle important interrupts during the sleep
While something like CF 5118 will take care of point #1, it will not deal
with #2. Also, the v11 [4]/messages/by-id/e3297b5e-0b33-4f45-afcd-4b00ba0b3547@gmail.com patch does not do #2 either. So I think
in the sleep loop, we need a C_F_I call. The same type of loop can
also be used to call WaitForSingleObject.
If CF 5118 gets committed, will still need similar loop that calls C_F_I,
but the function will need to call WaitLatchUs [5]/messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com.
Thoughts?
--
Sami
[1]: https://commitfest.postgresql.org/49/5118/
[2]: /messages/by-id/CA+hUKG+f-nEc_SowDLW1JMUa6Of5sCK-JZ=v-KhL1xgXk83fiw@mail.gmail.com
[3]: /messages/by-id/CA+hUKGKpo3fM=rnfdxHqt+jNGh_zUNcL1ob4hMsb=jFfKn9Aag@mail.gmail.com
[4]: /messages/by-id/e3297b5e-0b33-4f45-afcd-4b00ba0b3547@gmail.com
[5]: /messages/by-id/CA+hUKGKVbJE59JkwnUj5XMY+-rzcTFciV9vVC7i=LUfWPds8Xw@mail.gmail.com
Hi,
On Tue, Aug 20, 2024 at 02:25:19PM -0500, Sami Imseih wrote:
As it looks like we have a consensus that reducing the number of interrupts also
makes sense, I just provided a rebase version of the 1 Hz version (see [0], that
also makes clear in the doc that the new field might show slightly old values).That makes sense. However, I suspect the "1 Hz" work code will no longer
be needed if CF entry 5118 [1] mentioned by Thomas [2] a few days back
goes through. Maybe this extra work can be removed if [1] goes through.
What do you think?
Yeah agree that the "1 Hz" work wouldn't be needed anymore if the CF entry 5118
goes through *and* if vacuum delays keep doing pg_usleep() (as the leader won't
receive SIGUSR1 from the parallel workers while executing nanosleep() anymore).
It could still receive SIGHUP or such but that's outside of the PqMsg_Progress
case though.
With regards to CF 5118 and what it means to the current discussion, below
are my thoughts.I tested with both CF 5118 [1] and the cost-delay tracking patch. With that in place,
pg_usleep is able to sleep the full requested, as mentioned by Thomas [3]. This is
because certain interrupts like Parallel Message and others are not signaled
by SIGUSR1 any longer, but with latches.
Yeah.
From this discussion, there is desire for a sleep function that:
1/ Sleeps for the full duration of the requested time
2/ Continues to handle important interrupts during the sleepWhile something like CF 5118 will take care of point #1,
I'm not sure, even with CF entry 5118, nanosleep() could be interrupted. But I
agree that the leader won't be interrupted by PqMsg_Progress anymore.
it will not deal
with #2. Also, the v11 [4] patch does not do #2 either. So I think
in the sleep loop, we need a C_F_I call. The same type of loop can
also be used to call WaitForSingleObject.If CF 5118 gets committed, will still need similar loop that calls C_F_I,
but the function will need to call WaitLatchUs [5].Thoughts?
If CF 5118 gets committed, then I think we would need to move to using WaitLatchUs()
to react to urgent request. We'd also need to find a way to ensure that we'd
wait for the full duration of the requested time depending of the reason why we
waked up (well, only if we agree that 1/ is needed and I'm not sure we got a
consensus).
So I think that:
1. we should still implement the "1 Hz" stuff as 1.1/ it could be useful if CF
5118 gets committed and we move to WaitLatchUs() and 2.2/ it won't break anything
if CF gets committed and we don't move to WaitLatchUs(). For 1.1 it would still
prevent the leader to be waked up too frequently by the parallel workers.
2. still discuss the "need" and get a consensus regarding a sleep that could
ensure the wait duration (in some cases and depending of the reason why the
process is waked up).
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From this discussion, there is desire for a sleep function that:
1/ Sleeps for the full duration of the requested time
2/ Continues to handle important interrupts during the sleepWhile something like CF 5118 will take care of point #1,
I'm not sure, even with CF entry 5118, nanosleep() could be interrupted. But I
agree that the leader won't be interrupted by PqMsg_Progress anymore.
Correct.
1. we should still implement the "1 Hz" stuff as 1.1/ it could be useful if CF
5118 gets committed and we move to WaitLatchUs() and 2.2/ it won't break anything
if CF gets committed and we don't move to WaitLatchUs(). For 1.1 it would still
prevent the leader to be waked up too frequently by the parallel workers.
Yes, regardless of any changes that may occur in the future that change the behaior
of pg_usleep, preventing a leader from being woken up too frequently is
good to have. The "1 Hz" work is still good to have.
2. still discuss the "need" and get a consensus regarding a sleep that could
ensure the wait duration (in some cases and depending of the reason why the
process is waked up).
Unless there is objection, I will withdraw the CF [1]https://commitfest.postgresql.org/49/5161/ entry for this patch next week.
This discussion however should be one of the points that CF 5118 must deal with.
That is, 5118 will change the behavior of pg_usleep when dealing with interrupts
previously signaled by SIGUSR1.
[1]: https://commitfest.postgresql.org/49/5161/
Regards,
Sami